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Abstract 


Proposed  is  a  fault-tolerant  multiprocessor  architecture  which  needs 
much  less  redundant  hardware  than  Modular  Redundancy  architec¬ 
tures.  The  architecture  uses  weighted  checksum  techniques  and  is 
suited  for  linear  Digital  Signal  Processing  applications  in  which  mul¬ 
tiple  copies  of  the  identical  processor  are  used  to  meet  the  through¬ 
put  requirement.  Single  fault  detection/correction  and  multiple  de¬ 
tection/correction  techniques  are  discussed.  Also  proposed  are  sta¬ 
tistical  fault  detection/correction  algorithms  for  systems  containing 
numerical  roundoff  or  truncation  noise  such  as  fixed  point  or  floating 
point  systems.  Presented  are  the  simulations  of  these  algorithms  as 
well  as  the  simulations  of  numerical  noise  distributions  in  real  fixed 
point  system  applications.  Our  choice  of  weights  reduces  the  dy¬ 
namic  range  requirement  of  the  checksum  processors  and  minimizes 
the  masking  of  small  faults  by  the  numerical  noise.  Efficient  fault 
detection/correction  algorithms  for  the  exact  arithmetic  systems  are 
presented,  including  one  for  residue  arithmetics  systems.  Practical 
architectures  for  implementing  the  single  fault  detection/correction 
algorithm  are  also  presented.  These  architectures  are  designed  to 
mask  any  single  component  failure  in  the  system. 
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Chapter  1 
Introduction 


In  conventional  fault-tolerant  applications,  multiple  copies  of  the  processor 
are  used  to  mask  one  or  more  faults.  This  technique  is  called  the  Mod¬ 
ular  Redundancy  Technique  [Johnson  84,  Nelson  82,  Losq  76,  de  Sousa 
78,  Wensley  78].  One  of  the  most  popular  is  Triple  Modular  Redundancy 
[von  Neumann  56,  Johnson  84,  Siewiorek  82],  that  von  Neumann  proposed 
during  1950’s  in  which  three  identical  processors  and  a  majority  voter  are 
used  to  mask  a  single  fault  as  shown  in  figure  1.1.  The  major  advantage  of 
Triple  Modular  Redundancy  is  that  it  is  simple  to  implement  and  that  it  can 
be  used  for  arbitrary  applications.  The  major  disadvantage  of  the  Triple 
Modular  Redundancy  Technique  is  that  it  requires  much  excess  hardware. 
Two-thirds  of  the  hardware  is  being  used  for  fault-tolerance  purposes. 

Double  Modular  Redundancy  shown  in  figure  1.2  is  another  alternative, 
in  which  two  copies  of  the  processor  are  used  to  detect  a  single  fault.  In  this 
case,  only  half  of  the  hardware  is  being  used  for  fault-tolerance  purposes. 
However,  once  the  fault  is  detected,  the  system  is  left  with  the  non-trivial 
time-consuming  task  of  figuring  out  which  processor  is  at  fault.  One  way 
to  check  which  processor  is  faulty  is  to  interrupt  the  processors  and  run 
seif-diagnostic  programs.  Although  a  “permanent”  hardware  fault  can  be 
discovered  this  way,  a  “transient”  fault  would  not  be  discovered  with  this 
method.  One  possible  method  to  discover  the  transient  fault  is  simply  to  do 
the  computation  over  again.  If  the  fault  were  transient,  the  results  should 
agree  a  second  time.  In  order  to  do  this,  the  processors  have  to  keep  record 
of  “check  points”  where  the  mterna!  states  of  the  processors  are  stored  away 
at  regular  intervals  so  that  when  a  fault  is  discovered,  the  processor  can  be 
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Figure  1.1:  Triple  Modular  Redundancy 
restarted  from  the  last  check  point. 

There  are  a  number  of  other  alternatives  which  attempt  to  increase  the 
hardware  utilization  to  above  50  %.  One  example  is  the  self-testing  system 
[Johnson  84).  The  idea  is  that  the  system  occasionally  suspends  the  current 
job,  stores  the  internal  state,  and  runs  the  diagnostic  test  programs  which 
check  for  the  hardware  faults.  Another  example  is  the  Roving  Emulator 
system  [Breuer  83j.  In  this  system,  only  one  part  of  the  system  is  tested 
for  the  fault  at  one  time.  The  fault  checker  checks  a  hardware  module 
by  emulating  its  function,  using  the  same  inputs  and  internal  states,  and 
comparing  the  output.  If  no  fault  is  detected  for  a  while,  the  fault  checker 
proceeds  to  check  another  hardware  module.  Although  these  systems  may 
utilize  greater  than  50%  of  hardware,  they  do  not  detect  or  mask  all  the 
faults,  and  it  may  take  a  while  to  detect  a  failure. 

The  systems  that  achieve  high  reliability  with  little  redundant  hardware 
are  data  transmission  systems  using  error  coding  techniques.  The  idea  be¬ 
hind  this  system  is  to  encode  the  data  with  a  reliable  encoder,  transmit  the 
encoded  data  using  a  slightly  larger  bandwidth  than  the  minimum  band¬ 
width  required  by  the  non-encoded  data,  and  then  decode  the  data  at  the 
receiving  end  with  a  reliable  decoder.  Even  if  some  data  were  destroyed 
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Figure  1.2:  Double  Modular  Redundancy 

in  the  transmission  process,  the  decoder  can  reconstruct  the  data  as  long 
as  not  too  much  of  the  data  were  destroyed.  How  much  extra  transmis¬ 
sion  bandwidth  is  required  depends  on  the  noise  model  of  the  transmission 
channel  and  the  reliability  requirement  of  the  application.  Error  coding 
techniques  have  also  been  used  in  memory  systems.  Although  the  error 
coded  systems  achieve  high  reliability  with  low  hardware  overhead,  they 
can  only  be  used  when  the  output  is  identical  to  the  input. 

However,  if  we  restrict  the  class  of  applications,  it  is  possible  to  apply 
techniques  similar  to  the  error  coding  technique  to  other  systems  in  order 
to  achieve  high  reliability  with  little  hardw'are  overhead.  Good  examples  of 
reliable  systems  with  little  hardware  overhead  have  been  studied  by  Huang 
and  Abraham  [Huang  82,  Huang  84]  and  Jou  [Jou  84,  Jou  86|.  They  have 
achieved  single  fault  correction  for  matrix  operations  in  linear  or  mesh- 
connected  processor  arrays  using  the  weighted  checksum  approach.  The 
input  matrix  is  encoded  using  the  weighted  checksum  technique,  processed 
in  a  processor  array  containing  slightly  more  processors  than  the  non-fault- 
toleramt  case,  and  the  output  matrix  is  decoded  also  using  the  weighted 
checksum  technique.  However,  their  architecture  is  limited  to  doing  the 
matrix  operations  in  array  processors,  and  their  fault-tolerance  algorithm 
only  covers  the  processor  calculation  error.  It  is  assumed  that  there  is 
no  fault  in  the  datapath  and  that  the  faulty  processor  is  still  capable  of 
passing  the  incoming  data  onto  the  next  processor  without  introducing 
an  error.  Their  choice  of  weight  also  requires  the  dynamic  range  of  the 
“extra”  processors  to  be  much  greater  than  the  original  processors,  and  the 
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numerical  noise  from  “extra”  processors  heavily  masks  the  errors  from  other 
processors  in  non-exact  arithmetic  systems  (fixed  or  floating  point  systems). 
They  also  do  not  have  effective  fault  detection/correclion  algorithms  for  the 
systems  with  numerical  noise. 

Proposed  in  this  thesis  is  a  more  general  fault-tolerant  multiprocessor 
architecture  with  low  hardware  overhead  that  can  be  used  for  any  linear 
signal  processing  purpose.  It  also  uses  weighted  checksum  techniques  and 
its  datapath  is  designed  to  tolerate  any  single  point  failure.  Our  choice  of 
weights  reduces  the  dynamic  ranges  of  the  the  “extra”  checksum  proces¬ 
sors,  and  improves  the  fault  detection/correction  capability  in  the  presence 
of  numerical  noise.  We  also  introduce  effective  fault  detection/correction 
algorithms  for  the  systems  with  numerical  noise  as  well  as  for  the  exact 
arithmetic  systems  including  the  residue  arithmetic  systems. 

Our  work  is  presented  as  follows.  In  chapter  2,  we  shall  present  the  basic 
idea  behind  the  fault-tolerant  multiprocessor  architecture  for  DSP.  In  chap¬ 
ter  3,  the  fault  detection/correction  algorithms  are  presented  for  the  sys¬ 
tems  containing  numerical  roundoff  or  truncation  noises.  In  chapter  4,  we 
shall  present  the  generalized  likelihood  ratio  test  for  detection/correction  of 
the  faults.  In  chapter  5,  the  simulations  of  these  fault  detection /correction 
algorithms  and  the  simulation  of  the  numerical  noise  distribution  in  vari¬ 
ous  systems  are  presented.  In  chapter  6,  we  shall  discuss  the  multiple  fault 
delection/correction  algorithms  in  the  presence  of  the  numerical  noise.  In 
chapter  7,  we  shall  present  efficient  algorithms  of  fault  detection/correction 
for  exact  arithmetic  systems  including  the  residue  number  arithmetic  sys¬ 
tems.  In  chapter  8,  we  shall  discuss  the  practical  architectures  suitable  for 
implementing  the  fault-tolerant  architecture.  Chapter  9  is  the  conclusion. 
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Chapter  2 

Fault- tolerant  DSP 
Multiprocessor  Systems 


2.1  The  Checksum  Architecture 

2.1.1  Single  Fault  Detection 

Digital  Signal  Processing  systems  often  have  to  perform  linear  processing 
tasks  on  massive  amounts  of  incoming  data  at  a  very  rapid  rate,  often  in  real 
time.  Radar  and  sonar  systems  are  typical  of  such  applications.  Massive 
computational  requirements  often  lead  to  highly  parallel  multiprocessor 
architectures  [Faithi  83].  In  such  multiprocessor  environments,  it  1-  not 
unusual  to  have  multiple  processors  doing  the  identical  linear  processing 
task.  Typical  of  linear  processing  tasks  are  filters  and  transforms. 

Here  is  an  example  of  a  simple  multiprocessor  DSP  architecture,  which 
uses  a  number  of  identical  linear  processors.  Suppose  the  speed  of  a  single 
processor  is  N  times  slower  them  what  is  required  by  the  application.  One 
can  use  N  processors  in  a  rotating  basis  to  meet  the  throughput  require¬ 
ment.  The  first  segment  of  the  input  data  goes  to  the  first  processor,  the 
second  segment  of  the  input  data  goes  to  the  second  processor,  and  so  on. 
By  the  time  the  processor  has  received  its  input,  the  first  processor  has 
outputted  its  results  and  is  ready  to  receive  the  first  segment  of  the  next 
batch  of  data. 

In  a  multiprocessor  architecture  with  N  processors  doing  identical  linear 
tasks  on  different  set  of  input  data,  it  is  possible  to  use  a  technique  simi- 
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Figure  2.1:  Information  Spreading  for  Fault  Tolerance 

lar  to  error  coding  technique  [Kohavi  78,  Siewiorek  82]  in  order  to  achieve 
fault  detection/correction.  In  data  transmission  systems,  the  data  to  be 
transmitted  is  often  error  coded  to  spread  the  information  over  a  wider 
bandwidth.  It  is  then  sent  over  the  transmission  channel  which  requires 
a  slightly  larger  bandwidth  than  the  minimum,  and  is  decoded  on  the  re¬ 
ceiving  end.  The  reliability  of  the  system  can  be  greatly  enhanced  with 
relatively  little  extra  transmission  bandwidth,  provided  that  encoders  and 
decoders  are  fault-free.  Similar  things  can  be  done  in  the  multiprocessor 
environment.  We  would  like  to  encode  the  input  data  and  “spread”  the 
information  over  a  wider  database.  We  would  then  want  to  process  these 
data  with  slightly  more  processors  than  the  minimum  required,  and  decode 
the  outputs  as  shown  in  figure  2.1.  The  encoding  and  decoding  would  be 
such  that  the  desired  number  of  processor  faults  can  be  detected  or  masked, 
provided  that  the  encoder  and  the  decoder  are  fault  free. 

The  single  fault  detection  version  of  such  a  system  is  shown  in  figure 
2,2.  It  consists  of  N  linear  processors  processing  different  sets  of  input 
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data.  Let  x^.  be  the  input  data  segment  and  be  the  output  data  segment 
of  the  processor  k  for  the  current  batch  of  data.  There  is  an  extra  (TV  + 
l)‘^  processor  in  the  system  which  is  responsible  for  fault  detection.  It 
is  an  identical  processor  doing  the  identical  linear  processing  as  all  other 
processors,  and  its  input  is  the  sum  of  all  other  inputs. 

N 

^N+1  =  (2-1) 

k=\ 

We  shall  call  the  extra  (TV  +  1)*^  processor  the  checksum  processor  and  all 
other  processors  the  data  processors.  In  absence  of  a  fault,  the  output  of 
the  checksum  processors  should  be  equal  to  the  output  checksum,  which  is 
the  sum  of  the  outputs  of  the  data,  processor.  That  means  the  syndrome 
Sj^+i  of  the  (TV  -f  checksum  processor  defined  as 

N 

^s+i  =  Vn+i  -YlUk  (2-2) 

t=i 

should  be  equal  to  zero  when  there  is  no  fault.  If  there  is  a  fault  in  the 
system,  it  would  not  be  equal  to  zero.  Therefore,  one  can  achieve  single 
fault  detection  with  a  single  checksum  processor. 


2.1.2  Single  Fault  Correction 

The  checksum  processor  is  very  much  like  a  “parity”  processor.  In  the  error 
coding  technique  with  a  parity  bit,  the  parity  bit  is  formed  by  modulo- 
2  addition  of  all  the  data  bits  or  their  complements.  In  the  checksum 
processor  case,  the  input  to  the  checksum  processor  is  formed  by  adding 
inputs  of  the  data  processors. 

Treating  the  checksum  processor  like  a  parity  processor,  single  error 
correcting  codes  such  as  Hamming  Code  can  be  directly  applied  to  the 
multiprocessor  system.  In  error  correcting  codes,  there  are  a  number  of 
parity  bits,  each  the  modulo-2  sum  of  the  different  set  of  data  bits.  If  there 
is  no  error,  all  the  parity  bits  match  correctly.  If  one  of  the  bits  is  faulty,  the 
faulty  bit  can  be  located  by  looking  at  which  parity  bits  mismatch.  Figure 
2.3  shows  how  the  Hamming  Code  can  be  used  for  single  fault  correction  in 
the  multiprocessor  architecture  to  protect  four  data  processors  using  three 
checksum  processors.  The  processors  1  through  4  are  data  processors  and 
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Figure  2.2:  Single  Fault  Detection  Checksum  Architecture 

the  processors  5  through  7  are  the  checksum  processors.  On  the  right  of  the 
processors  are  their  addresses  which  are  used  for  fault  location.  Notice  that 
the  processor  address  is  different  from  the  processor  number.  The  lines  on 
the  left  of  the  processors  represent  which  data  processor  inputs  are  used  to 
form  the  inputs  to  each  checksum  processor.  That  means 

Xs  =  Xi  +  X2  + 

Xg  =  Xj  +  ^  +  I.j 

X7  =  l2  +  ^  +  X4  (2.3) 

and 

^5  =  1/5  -  El  -  y.2  ■ 

^7  =  ^7  -  ^2  “  ”  ^4 

Let  us  define  the  checksum  match  variable  hm  to  be  equal  to 
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0  >f  ^N+i  ■-=  Q 


(2.5) 


[  1  if 

The  faulty  processor  location  is  done  from  the  codeword  (67  65  65).  When 
all  the  processors  are  working  correctly,  the  codeword  is  (0  0  0).  If  one  of 
the  processors  is  faulty,  the  binary  value  codeword  is  equal  to  the  address 
of  the  faulty  processor. 

If  the  faulty  processor  is  a  data  processor,  its  correct  output  can  be 
calculated  using  the  syndrome  of  one  of  the  checksum  processors  that  are 
doing  the  checking  on  the  faulty  processor.  If  the  data  processor  is 
faulty,  and  if  it  is  being  checked  by  the  m‘^  checksum  processor,  the  correct 
output  is  equal  to 


=  Mi  +  (2.b) 

Note  that  when  a  checksum  processor  is  detected  as  being  at  fault,  it  is 
usually  not  necessary  to  correct  it. 

Using  this  scheme,  the  number  of  the  checksum  processors  C  and  the 
number  of  the  data  processors  N  required  to  achieve  single  fault  correction 
are  related  by 


N  +  C  <2^  -1  (2.7) 

This  formula  comes  from  the  fact  that  there  are  2^  possible  values  for  the 
code  word  {6Ar+c  bs+c-i  •••^a^+i))  which  has  to  be  able  to  represent  N  +  C 
possible  single  processor  failure  modes  and  one  no-fault  mode. 

Unfortunately,  the  error  coding  technique  cannot  be  used  for  the  mul¬ 
tiple  fault  detection/correction  cases.  This  is  because  the  error  coding 
techniques  are  based  on  modulo-2  arithmetic. 


2.2  The  Weighted  Checksum  Architecture 

2.2.1  The  Use  of  Weighted  Checksums 

The  checksum  technique  of  the  previous  section  is  a  special  case  of  the 
weighted  checksum  technique.  The  weighted  checksum  technique  requires 
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Address 
(br  be  65) 

(0  1  1)  =  3 

(1  0  1)  =  5 

(1  1  0)  =  6 

(1  1  1)  =  7 

(0  0  1)  =  1 
(0  1  0)  =  2 
(1  0  0)  =  4 


Figure  2.3:  Single  Fault  Correction  Checksum  Architecture 
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Zm=l  WN+C,mX^  - >1  Processor  N  +  C  | - ►  Em=l  ^N+c.mV^ 

Figure  2.4:  Weighted  Checksum  Architecture 

fewer  checksum  processors  than  the  simple  checksum  technique  and  can 
handle  multiple  fault  detection/correction  cases.  In  this  case,  the  inputs  to 
the  checksum  processors  are  formed  by  taking  linear  combinations  of  the 
data  processor  inputs.  Suppose  there  are  C  checksum  processors  identical 
to  the  data  processors  eis  shown  in  figure  2.4.  Then  the  inputs  to  those 
checksum  processors  are 

N 

H  for  k  =  N +  C  (2.8) 

m=l 

where  Wk,m  are  the  scalar  weights. 

Let  F  be  the  linear  function  that  the  processors  perform. 

=  Fi*  for  k  =  l,...,iV  (2.9) 
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If  we  assume  that  input  has  p  data  points  and  the  output  has  q  data 
points,  then  F  is  a  g  by  p  matrix.  The  output  of  the  checksum  processors 
can  now  be  written  as 


N 

y^  =  Fxt  =  Yi  for  k  ^  N  +  1, N  +  C  (2.10) 

The  syndrome  is  defined  as  the  difference  between  the  checksum  proces¬ 
sor  output  and  the  corresponding  output  checksum. 

N 

=  y,,-  Y  for  k  =  N  +  +  C  (2.11) 

m=l 

If  there  is  no  fault  in  the  system,  the  all  s*  should  be  equal  to  zero.  When 
there  are  faults  in  the  system,  let  us  assume  that  the  fault  in  processor  k 
introduces  error  on  to  the  output  y^. 

y^  =  Fifc -f- ifc  for  k  —  1, N C  (2-12) 

The  syndromes  are  now  equal  to 

N 

Sk  =  i.k-  Y  for  k  =  N  +  I, ...,  N  +  C  (2.13) 

m=l 

We  can  simplify  this  equation  by  defining  the  weights  for  the  checksum 
processors  to  be 


V}N+i,N+j  =  S 


-1  for  i  =  j  =  1,  ...,C 


(2.14) 


[  0  for  i  ^  j 

Now,  the  syndrome  can  be  written  as 


N+C 


“'*.">4"  k  =  N  +  +  C 


(2.  IS) 


m— 1 
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We  can  expand  our  fault  detection/correction  algorithms  to  include  the 
faults  in  the  input  checksum  calculations  and  the  syndrome  calculations, 
provided  that  each  checksum  processor  has  a  separate  hardware  module 
for  computing  its  input  checksum  and  syndrome.  The  fault  in  the  input 
checksum  calculation  or  the  syndrome  calculation  can  be  treated  as  if  the 
fault  were  in  the  corresponding  checksum  processor.  The  fault  occurring 
while  calculating  the  input  checksum  causes  the  input  checksum  to  deviate 
from  the  correct  value  by  the  error  6^. 

N 

?*  =  51  for  k  =  N  +  I, ...,  N  +  C  (2.16) 

m=l 

The  fault  occurring  while  calculating  the  syndrome  introduces  error  Aj.  on 
to  the  syndrome. 


N 

Sk  =  yk~  ^  +  1.  "M  N  +  C  (2.17) 

m=  1 

The  effect  of  all  three  types  of  faults  on  the  syndrome  is 


5*  —  ffc  +  +  Ajt  —  ^  tVk,mim  for  k  —  N  +  1, N  -h  C  (2.18) 

m=l 


Notice  that  the  results  of  the  input  checksum  error  and  the  output 
checksum  error  A*  are  indistinguishable  from  the  checksum  processor  error 
ik  as  far  as  their  effects  on  the  syndromes  are  concerned.  They  only  affect 
the  k*’*  syndrome,  while  the  data  processor  error  affects  all  the  syndromes 
with  non-zero  weights.  Therefore,  we  shall  treat  the  input  checksum  or  the 
syndrome  calculation  error  as  part  of  the  checksum  processor  error.  From 
now  on,  when  we  refer  to  checksum  processor  error,  it  shall  automatically 
include  the  input  checksum  and  the  syndrome  calculation  error.  Let  us 
define  the  composite  error  as 


ik  for 

ik  +  F5t  +  Xk  for 


k  --=  l,...,iV 
k  —  N  N  C 


(2.19) 
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The  syndrome  and  the  composite  error  are  related  by 


N+C 

=  ^  +  1,  ^  ^  (2.20) 

m=l 

2.2.2  Single  Fault  Correction 

Let  us  examine  how  a  single  fault  can  be  detected  and  corrected  using 
weighted  checksums.  When  all  the  processors  are  working  correctly,  all  the 
syndromes  should  be  equal  to  zero.  When  only  the  processor  k  is  faulty 
with  the  error  the  syndromes  would  be 

1,V+1  =  + 

^N+2  —  ^N+2.k4>L^ 

.  (2.21) 

=  ^N+C.k^j^ 

Let  us  define  the  weight  vector  to  be 

UJ*  =  (iWAf  +  l.t,  U;jV+2.t,  •  ■  •  ,  WN+C.k)  (2.22) 

With  given  s^’s,  one  can  distinguish  processor  fci  failure  from  processor  k2 
failure  if  tvi  is  linearly  independent  from  W2-  That  means  the  w^'s  have 
to  be  linezirly  independent  from  each  other  in  order  to  have  single  fault 
location.  In  order  for  the  tw^’s  to  be  linearly  independent,  one  needs  at 
least  two  syndromes  (i.e.  two  checksum  processors). 

Let  us  consider  a  very  simple  example  case  with  two  data  processors 
and  two  checksum  processors  (i.e.  N=2  and  C=2).  The  weight  vectors 
used  are  u;f  =  (1,1),  W_2  =  (1,-1).  In  figure  2.5,  all  the  possible  values 
for  the  syndromes  are  drawn  in  the  syndrome  plane.  When  the  processor 
k  is  faulty,  the  syndrome  would  fall  on  the  line  along  the  vector  with  its 
distance  from  the  origin  proportional  to 

Once  the  faulty  processor  has  been  located,  its  correct  output  can  be 
calculated  from  its  faulty  output  and  any  one  of  the  syndromes  with  non¬ 
zero  weight.  If  the  k*^  data  processor  is  faulty,  its  correct  output  is  equal 
to 
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y*  =  y^  +  —  (2.23) 

Note  that  when  a  checksum  processor  is  detected  as  being  faulty,  it  is  not 
necessEU'y  to  correct  its  output. 

A  special  catse  of  the  single  fault  correction  is  when  there  is  one  data 
processor  zmd  two  checksum  processors  (TV  =  1,  C  =  2)  with  the  data 
processor  weight  vector  tyf  =  (1,1).  This  means  that  all  three  processors 
have  the  same  input  which  is  equi\'alent  to  the  Triple  Modular  Redundancy 
technique. 


2.2.3  Block  Matrix  Notation 

The  equations  for  the  syndrome  values  in  presence  of  a  fault  can  be  written 
in  a  convenient  block  vector  ^  notation.  At  this  point,  we  shall  define  the 
syndrome  block  vector  s  and  the  composite  error  block  vector  to  be 


5  = 

1 

^N+C  J 

\  ^N+C  1 

We  also  define  the  block  weighting  matrix  W, 


(2.24) 


•••  +  —I  0  ...  0 

^iV+2,lI  *^JV+2,2l  •••  U;jv+2.A/I  0  — I  .  .  .  0 


L  ^N+C,li  WN+C,2  •  •  •  V}n+C,N^  0  ...  0  —I 


(2.25) 


where  I  is  a  9  by  9  identity  matrix.  With  this  block  matrix  notation,  the 
syndromes  can  simply  be  written  as 


s  =  _W^  (2.26) 

Let  us  also  define  W*  to  be  the  k**'  block  column  of  W.  Then  in  case  of 
the  processor  faillure,  the  syndrome  will  be: 

‘  Block  vectors  (matrices)  are  ordinary  vectors  (matrices)  whose  elements  are  themselves 
vectors  (matrices). 
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Figure  2.5:  Possible  Single  Fault  Syndrome  Values 


\ 
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(2.27) 


2.2.4  Multiple  Fault  Detection/Correction 

Let  us  examine  what  happens  when  there  are  two  faulty  processors  in  the 
single  fault  correction  system  with  two  checksum  processors.  The  value  of 
the  syndrome  s  would  be  the  linear  combination  of  the  two  weight  vectors 
corresponding  to  the  faulty  processors.  Since  all  the  weight  vectors  are  lin¬ 
early  independent,  the  syndrome  cannot  be  at  the  origin  of  the  syndrome 
space.  Thus,  the  presence  of  up  to  two  faults  can  be  recognized  by  the 
syndrome  not  being  at  the  origin.  However,  one  cannot  distinguish  which 
processors  have  failed,  since  the  non-zero  syndrome  can  be  explained  by  the 
linear  combination  of  any  two  processor  failures.  Also,  one  cannot  always 
tell  whether  one  processor  has  failed  or  two,  since  the  linear  combination 
of  two  processor  failures  can  lie  on  another  processor  weight  vector  line. 
When  there  are  three  faulty  processors,  it  is  possible  that  the  three  pro¬ 
cessor  errors  would  be  multiplied  by  weights  and  added  to  form  all-zero 
syndromes.  Therefore,  one  cannot  reliably  detect  three  failures  with  two 
checksum  processors. 

In  general,  to  be  able  to  detect  Lj^  failures,  we  have  to  make  sure  that 
every  possible  set  of  Lm  weight  vectors  has  to  be  linearly  independent  from 
each  other,  so  that  any  combination  of  up  to  Lm  failures  would  produce 
the  non-zero  syndromes.  This  requires  the  weight  vectors  to  be  at  least 
Lm  dimensional  vectors  which  means  that  one  needs  at  least  Lm  checksum 
processors.  In  order  to  be  able  to  detect  and  correct  up  to  Km  failures, 
every  possible  set  of  2Km  weight  vectors  has  to  be  linearly  independent 
from  each  other.  This  is  because  the  syndrome  hyperplane  defined  by  all 
possible  linear  combinations  of  one  set  of  Km  weight  vectors  should  not 
intersect  with  another  hyperplance  defined  by  all  possible  combinations 
of  2aiother  set  of  Km  weight  vectors,  except  at  the  origin.  One  needs  at 
least  a  2Km  dimensional  syndrome  space  to  be  able  to  achieve  this,  and  it 
requires  at  least  2Km  checksum  processors.  If  one  is  interested  in  detecting 
up  to  Km  +  Lm  failures  and  correcting  failures  if  Km  or  fewer  failures  have 
occurred,  one  needs  a  minimum  of  2Km  +  Lm  checksum  processors  as  proven 
in  appendix  A. 
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C  >  2i<'„  + 


(2.28) 


2.2.5  Choice  of  Weight  Vectors  for  Single  Fault  Cor¬ 
rection 

The  weight  vectors  for  single  fault  correction  should  be  chosen  such  that 
they  minimize  the  computational  effort  involved  in  calculating  the  input 
checksums  and  the  syndromes,  as  well  as  minimizing  the  number  of  check¬ 
sum  processors  needed  for  protecting  the  given  number  of  data  processors. 
Jou  and  Abraham  [Jou  84]  used  wtm  =  as  weights  for  the 

data  processors.  These  weights  have  the  advantage  that  the  multiply  by  a 
power  of  2  can  be  done  in  simple  bit-shifts.  However,  the  dynamic  range 
oi  the  checksum  processor  registers  have  to  be  much  greater  than  the  data 
processors  in  order  to  be  able  to  accommodate  w^+c.N  =  2(c-mN-i)  There 
are  also  disadvantages  in  having  the  weights  varying  largely  in  size.  For  ex¬ 
ample,  in  the  presence  of  numerical  computation  noise,  the  numerical  noise 
generated  from  the  processors  with  large  weights  would  mask  the  output 
signal  of  the  processor  with  small  weights  in  the  syndrome  calculations,  as 
we  shall  see  in  later  sections. 

One  choice  of  the  weights  that  simplifies  the  checksum  computation 
is  to  use  only  0  and  1  as  weights.  This  eliminates  the  multiplies  in  the 
checksum  computations  and  is  equivalent  to  using  simple  checksums  instead 
of  weighted  checksums.  It  is  also  equivalent  to  using  the  conventional  error 
coding  techniques.  Using  these  weights,  there  can  be  up  to  A^  +  C  =  (2^  —  l) 
weight  vectors  for  a  given  number  of  checksum  processors  C .  For  example, 
we  can  have  four  data  processors  with  three  checksum  processors  {C  —  3, 
N  =  4).  The  W  matrix  in  this  case  with  g  =  1  (g  =  1  so  that  we  do  not 
have  to  write  in  the  block  matrix  form  with  the  identity  matrix  I)  would 
be 


W  = 


110  1-1  0  0 
1011  0-1  0 
0  111  0  0-1 


(2.29) 


The  TV  -I-  C  =  7  weight  vectors  in  this  case  are  seen  as  the  block  columns 
of  the  weight  matrix  W.  Recall  that  the  case  with  one  data  processor 
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with  two  checksum  processors  (iV  =  1,C  =  2)  is  equivalent  to  the  Triple 
Modular  Redundancy.  The  weight  matrix  in  that  case  is 


W  = 


1  -1  0 

1  0  -1 


(2.30) 


If  only  0,  1,  and  -1  are  allowed  to  be  weights,  we  still  do  not  need 
multiplications  in  the  check-summing  process,  but  there  can  be  more  data 
processors  for  a  given  number  of  checksum  processors  than  when  only  0 
and  1  were  used  as  weights.  There  can  be  up  to  +  C  =  (3^  —  l)/2 
weight  vectors  for  a  given  C.  For  example,  we  can  have  ten  data  processors 
with  three  checksum  processors  to  achieve  single  fault  correction  [N  —  10, 
C  =  3).  The  W  matrix  in  such  a  case  with  g  =  1  would  be: 


W  = 


1  10  0  1 
1-11  10 
0  01-11 


1111 
0  1  1-1 
-11-1  1 


1-10  0 
-1  0-1  0 

-1  0  0-1 


(2.31) 


Another  choice  for  weights  for  single  fault  correction  is  to  use  small 
integers.  This  makes  the  multiplications  in  the  checksum  computations 
relatively  easy  and  does  not  increcise  the  dynamic  range  of  the  checksum 
processors  nearly  as  much  eis  Huang’s  choices  for  weights.  A  good  way  to 
get  the  single  fault  correction  weight  vector  set  is  to  start  with  all  possible 
combinations  of  C  weights  each  between  —M  and  +L  and  to  eliminate 
the  zero  vector  and  any  vector  which  is  linearly  dependent  on  another 
vector.  If  weights  between  -2  and  +2  were  used,  they  would  not  require 
more  than  a  shift  operation  for  weight  multiplication,  and  we  can  have 
N-\-C  =  (5^  —  3^)/2  different  weight  vectors  for  given  C .  For  example,  with 
two  checksum  processors,  we  can  have  up  to  six  data  processors  (TV  =  6, 
C  =  2),  with  the  following  weight  matrix: 


W  = 


1  11  12  2-1  o' 

1-12-21-1  0-1 


(2.32) 


Table  2.1  shows  the  relationship  between  TV  and  C  depending  on  the 
range  of  weights  used.  Notice  that  one  can  protect  a  large  number  of  data 
processors  using  relatively  few  checksum  processors  and  using  only  very 
small  integers  as  weights. 
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Weights  Used 

0,1 

0,  ±1 

0,  ±1,  ±2 

Max  TV  for  C  =  2 

1 

2 

6 

Max  TV  for  C  =  3 

4 

10 

54 

Max  N  for  C  =  4 

11 

36 

268 

Mclx  tv  for  C  =  5 

25 

116 

1436 

Table  2.1:  The  Maximum  N  for  the  Weights  Used 

We  define  a  “complete”  set  of  weight  vectors  as  the  one  that  meets  the 
following  condition;  if  (aia2...ac)^  is  a  weight  vector,  then  every  (±ai  ± 
a2...  ±  must  also  be  a  weight  vector,  or  else  its  negative  must  be 

a  weight  vector.  The  complete  set  of  weight  vectors  is  created  from  the 
symmetric  range  of  integers  from  —  L  to  +L.  The  weight  matrix  W  for  the 
complete  sets  of  weight  vectors  has  the  property  that  WW^  is  a  scaled 
identity  matrix  ( WW^  =  a^I  where  is  a  scalar)  as  proven  in  appendix 
F.  This  property  becomes  useful  in  later  sections. 

If  complex  arithmetic  is  used  in  the  processors,  we  can  have  the  weights 
of  form  n  +  jm  where  n  and  m  are  small  integers.  Because  the  weights 
have  twice  the  dimensionality  of  the  non-complex  Ccise,  we  can  have  more 
data  processors  with  the  same  number  of  checksum  processors  than  in  the 
non-complex  case.  For  example,  if  we  use  -1,  0,  and  +1  for  n  and  m, 
we  can  have  iV  -I-  C  =  (9^  —  5^)/4  weight  vectors  possible.  With  two 
checksum  processors,  we  can  protect  up  to  twelve  data  processors  [N  ~  12, 
C  =  2).  This  is  many  more  than  in  the  non-complex  case  with  -1,  0,  and 
-hi  as  weights,  in  which  only  two  data  processors  can  be  protected  with 
two  checksum  processors.  The  weight  matrix  in  this  case  is: 


1111  1  1  1  1  1-hj  1-hj  H-i 

[  1  j  -1  -j  i+i  -i+j  -i-y  i-y  1  j  -1 

(2.33) 

If  3  checksum  processors  were  used,  we  can  protect  up  to  148  data  proces¬ 
sors  {N  =  148,  C  =  3).  The  “complete”  set  of  complex  weight  vectors  of 
this  form  has  the  property  that  WW^  =  where  is  a  scalar  and 
is  the  Hermitians  (complex  conjugate  transpose)  of  W. 


+  J  -1  0 

-j  0 
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2.2.6  Reliability 

A  system  failure  in  the  single  fault  detection  or  correction  system  occurs 
when  there  are  two  or  more  processor  failures  at  the  same  time.  Let  us 
assume  for  convenience  that  only  the  processors  can  fail  and  that  all  other 
parts  of  the  system  are  fault-free.  Let  Pf  be  the  single  processor  failure 
rate.  If  we  assume  that  Pf  is  much  less  than  one  [Pj  <<  1)  and  that  the 
processor  failures  occur  independently  from  each  other,  the  system  failure 
rate  would  be  approximately  equal  to 


Prob(>  1  failure)  «  — lip! 


(2.34) 


This  is  an  approximation  since  the  checksum  processor  failure  rate  would 
be  slightly  higher  than  P/,  since  it  includes  the  failure  rate  of  the  input 
checksum  and  the  syndrome  calculation.  Let  us  compare  this  failure  rate 
to  the  failure  rate  of  the  Triple  Modular  Redundancy  (TMR)  system  in 
which  each  of  the  N  processors  are  triplicated  for  single  fault  correction. 
The  system  failure  in  this  ceise  occurs  when  two  or  more  processors  fail  at 
the  same  time  in  one  or  more  of  the  N  triplicated  processor  groups. 


Prob(>  1  failure  in  any  triple)  ss  ZNPj  (2.35) 

The  system  failure  rate  in  our  architecture  is  approximately  (N  2C7)/6 
times  higher  than  the  TMR  system  failure  rate,  but  the  number  of  proces¬ 
sors  used  {N  -f  C)  is  significantly  less  than  37V  in  the  TMR  system. 

The  actual  system  failure  rate  of  our  system  would  be  higher,  since  we 
have  accounted  only  for  the  processor  failure  rate.  We  also  have  to  account 
for  the  failure  rate  of  any  other  parts  of  the  system,  such  as  the  control, 
busses,  the  decoder  which  is  responsible  for  locating  faulty  processor  from 
the  syndrome  and  correcting  its  output,  and  so  forth.  The  actual  system 
failure  rate  Pgyg  can  be  written  as 


(N  +  C)(JV  +  C  1)^,  ^  ^ 


(2.36) 


v/here  Pm  is  the  failure  rate  of  the  module  of  the  system  (not  includ¬ 
ing  the  processors  and  the  input  checksum  and  the  syndrome  calculators). 
Notice  the  first  term  in  the  equation  has  the  form  Pj^  whereas  the  rest  of 
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the  terms  have  the  form  Pm-  The  square  term  is  introduced  by  the  sin¬ 
gle  fault  detection  or  correction  of  the  processors.  If  any  of  the  Pm  were 
much  greater  than  Pf,  then  the  system  failure  rate  would  be  dominated  by 
that  term,  and  much  of  the  effort  that  went  into  making  the  system  fault- 
tolerant  would  be  wcisted.  Therefore,  any  hardware  module  whose  failure 
rate  is  significantly  greater  than  Pj  should  be  protected  by  Triple  Modular 
Redundancy  or  by  other  methods.  If  triple  modular  redundancy  were  used 
in  some  of  the  modules,  the  system  failure  rate  would  be  equal  to 


ays 


«  (£±£M±£_1)  p;  +  +  x;  iP? 


(2.37) 


where  Pm  is  equal  to  failure  tate  of  the  non-triplicated  hardware  modules, 
and  Pi  is  equal  to  the  failure  rate  of  the  triplicated  hardware  module. 


Chapter  3 

Numerical  Noise 


3.1  The  Effects  of  Numerical  Noise 

If  only  integer  arithmetic  were  used  in  the  system,  all  the  calculations  would 
be  exact  including  the  syndrome  calculation.  This  makes  it  relatively  trivial 
to  detect  and/or  correct  a  single  fault  from  the  syndromes.  In  single  fault 
detection,  the  syndrome  would  be  exactly  zero  when  there  is  no  fault  in  the 
system,  and  non-zero  when  there  is  a  fault.  In  locating  the  faulty  processor, 
the  non-zero  syndrome  would  lie  exactly  on  the  corresponding  weight  vector 
line  in  case  of  a  single  fault. 

However,  if  fixed  point  or  floating  point  arithmetic  were  used,  there 
would  be  roundoff  or  truncation  noise  included  in  the  processor  outputs  as 
well  as  in  the  input  checksums  and  the  syndromes.  This  numerical  noise 
causes  the  syndrome  to  deviate  slightly  from  zero  even  when  there  is  no 
fault  in  the  system.  In  cases  when  there  is  a  faulty  processor  in  the  system, 
the  syndrome  no  longer  lies  exactly  on  the  corresponding  weight  vector  line, 
but  may  deviate  from  it  slightly.  We  need  to  develop  reasonable  methods 
to  sort  out  the  small  numerical  noise  in  the  syndromes  from  the  effects  of 
the  faulty  processor,  and  be  able  to  make  reasonable  decisions  as  to  which 
processor  has  failed. 

A  reasonable  thing  to  do  is  to  model  the  numerical  noise  as  a  statistical 
process,  and  attempt  to  make  the  system  fault  diagnosis  based  on  the  sta¬ 
tistical  noise  model.  However,  if  the  fault  diagnosis  is  based  on  a  statistical 
noise  model,  any  method  that  we  devise  for  the  fault  diagnosis  cannot  be 
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correct  all  the  time.  There  are  many  design  criterions  for  the  fault  diag¬ 
nosis  method.  Obviously,  we  would  like  the  diagnosis  to  be  as  accura^ 
as  possible,  as  often  as  possible.  More  importantly,  we  want  to  design  the 
system  fault  diagnosis  method  so  that  a  false  diagnosis  would  not  have  any 
detrimental  effects  on  the  system  application. 


3.2  Single  Fault  Detection 

3.2.1  The  Threshold  Method 

For  the  single  fault  detection  system  with  one  checksum  processor,  we  have 
developed  the  following  threshold  method  for  detecting  a  fault.  The  idea 
behind  this  detection  method  is  that  the  small  non-zero  syndrome  is  likely 
caused  by  numerical  noise  and  the  large  non-zero  syndrome  is  likely  caused 
by  a  fault  within  the  system.  Let  us  assume  that  the  expected  mean  of  the 
numerical  noise  in  the  syndrome  is  equal  to  zero.  If  the  numerical  noise 
has  a  non-zero  expected  mean  (such  as  one  caused  by  some  truncation 
methods),  the  mean  value  can  be  subtracted  from  the  syndrome  to  yield 
the  zero  mean  syndrome. 

Let  us  start  by  examining  the  case  when  the  fault  detection  is  carried 
out  for  each  point  of  the  syndrome  (i.e.  as  if  9  =  1).  When  the  syndrome 
magnitude  exceeds  a  certain  preset  threshold  value  7,  the  system  would  be 
diagnosed  as  containing  a  fault. 

If  |s|  >  7  Then  A  Fault  Detected  (3.1) 

The  threshold  value  7  can  be  set  in  number  of  ways.  For  example,  one 
can  set  it  to  be  the  maximum  possible  noise  magnitude.  This  would  pre¬ 
vent  mistakenly  declaring  a  failure  when  the  non-zero  syndrome  is  actually 
caused  by  computational  noise.  However,  the  system  would  not  be  able 
to  detect  small  processor  errors  with  such  a  large  threshold.  Furthermore, 
it  is  highly  unlikely  that  the  numerical  noise  magnitude  of  the  syndrome 
magnitude  would  be  even  close  to  the  maximum  noise  magnitude.  A  more 
reztsonable  threshold  can  be  calculated  by  examining  the  probability  dis¬ 
tribution  of  the  syndrome  when  no  hardware  fault  has  occurred  (this  dis¬ 
tribution  might  be  derived  by  modeling  the  numerical  noise,  or  by  running 
experiments).  If  the  probability  distribution  of  the  fault  is  also  known,  it  is 
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possible  to  set  the  threshold  so  that  the  threshold  test  gives  the  most  likely 
explanation  for  the  observed  syndrome.  However,  the  failure  mechanisms 
are  difRcult  to  model  accurately.  One  reasonable  criterion  for  setting  the 
threshold  is  to  fix  the  false  alarm  rate  at  a  desired  value.  The  false  alarm 
rate  is  equal  to  the  probability  that  the  syndrome  magnitude  exceeds  7 
under  the  no-fault  condition,  and  is  thus  the  integral  of  the  probability 
distribution  of  the  noisy  syndrome  in  the  range  |s|  >  7. 

For  the  case  when  9  >  1,  the  threshold  testing  can  be  done  on  the  energy 
in  the  syndrome,  which  is  equal  to  the  sum  of  the  squares  of  the  syndrome 
data  points. 


If  >  7  Then  A  Fault  Detected  (3-2) 

Again,  the  threshold  value  needed  for  the  desired  false  alarm  rate  can  be 
calculated  from  the  probability  distribution  of  syndrome  energy  s^^.iS;v+i 
under  the  no-fault  condition. 

The  output  batch  size  q  plays  an  important  role  in  determining  whether 
the  system  is  better  at  detecting  a  small  transient  fault  or  a  small  perma¬ 
nent  fault.  The  transient  fault  usually  affects  few  data  points  within  the 
batch  and  the  permanent  fault  usually  affects  many  data  points  within  the 
batch.  The  mean  of  the  syndrome  energy  under  the  no-fault 

condition  grows  as  0(9),  where  as  the  standard  devation  grows  as  0{y/q). 
Therefore,  with  larger  9,  the  system  is  more  likely  to  detect  a  small  per¬ 
manent  failure  affecting  many  of  the  output  data  points.  However,  it  is 
less  likely  to  detect  a  transient  failure  affecting  only  few  of  the  output  data 
points,  since  even  a  relatively  large  transient  failure  may  not  increase  the 
average  syndrome  energy  enough  to  set  off  the  detection.  Of  course,  one 
has  the  option  of  testing  the  syndrome  in  both  ways,  using  a  large  9  and  a 
small  9  at  the  expense  of  more  computation. 

3.2.2  Effects  of  Incorrect  Fault  Diagnosis 

There  are  two  possible  false  fault  diagnoses  in  the  single  fault  detection 
system.  One  is  the  false  alarm  case  in  which  a  large  numerical  noise  sets  off 
the  fault  detection  mechanism,  and  the  other  is  when  a  fault  is  very  small 
and  escapes  detection.  There  is  a  tradeoff  between  these  two  kinds  of  fault 
misdiagnosis.  If  the  threshold  value  is  large,  there  is  a  low  probability  of 
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a  false  alarm  cause  by  the  numerical  noise,  but  there  is  a  high  probability 
that  a  small  fault  may  escape  detection.  If  the  threshold  value  is  small, 
the  false  alarm  probability  is  high,  but  there  is  less  chance  of  a  small  fault 
escaping  detection. 

How  the  false  alarm  affects  the  system  performance  depends  much  on 
how  the  system  handles  an  occurrence  of  a  fault.  If  it  is  desirable  to  avoid 
the  false  alarm,  the  threshold  value  should  be  set  sufficiently  high. 

In  the  case  when  a  fault  is  too  small  to  be  detected  reliably,  the  net  effect 
of  the  fault  is  the  slight  increase  in  the  noise  level  of  the  faulty  processor 
output.  An  example  of  such  a  small  fault  is  when  the  least  significant  bit  of 
the  processor  output  is  stuck.  Such  a  small  fault  can  easily  escape  threshold 
detection,  and  would  appear  to  the  system  as  a  slight  increase  in  the  noise 
level  of  the  corresponding  processor  output.  If  the  small  undetected  fault 
were  in  the  checksum  processor,  it  would  not  have  any  effect  on  the  system 
application. 

3.2.3  Dynamic  Range  vs.  Numerical  Noise 

The  dynamic  range  of  the  checksum  processor  is  also  an  important  con¬ 
sideration  in  designing  a  single  fault  detection  system  in  the  presence  of 
numerical  noise.  One  must  prevent  the  overflow  in  the  checksum  proces¬ 
sors  and  in  the  input  checksum  and  the  syndrome  calculations.  There  is 
also  a  tradeoff  between  the  number  of  bits  used  in  the  checksum  processor 
registers  and  the  fault  detection  capability.  The  exact  nature  of  the  tradeoff 
depends  heavily  on  whether  the  fixed  point  or  floating  point  algorithm  is 
used  and  on  the  computational  algorithms  employed. 

Since  the  checksum  processor  input  is  the  weighted  sum  of  N  data 
processor  inputs,  the  dynamic  range  of  the  checksum  processor  should 
be  (jjjc  times  larger  than  the  data  processors  in  order  to  prevent  overflow, 
where  ujt  is  defined  as 


N 

H  for  k  =  N  +  +  C  (3.3) 

m=l 

When  floating  point  arithmetic  is  used,  the  dynamic  range  is  not  gener¬ 
ally  a  big  problem.  However,  when  fixed  point  arithmetic  is  used,  the 
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checksum  processors  should  have  logj  u;*  more  bits  in  its  registers  than  the 
data  processors  (more  bits  should  also  be  used  in  the  input  checksum  and 
syndrome  calculations).  If  using  more  bits  in  the  checksum  processor  reg¬ 
isters  is  not  desired  because  of  the  added  complexity,  the  weights  should 
be  chosen  so  that  u)jt  <  1  in  order  to  prevent  the  overflow.  For  example,  in 
the  isingle  fault  detection  system  with  one  checksum  processor,  the  weights 
^N+i,k  can  all  be  equal  to  1/^,  except  for  tVN+i,N+i  which  is  equal  to  -1. 
However,  with  these  weights,  the  syndrome  Sfj+i  =  —  Ylm=\  is 

heavily  dominated  by  the  checksum  processor  error  The  numerical 

noise  from  the  checksum  processor  would  be  weighted  0{N)  times  higher 
than  the  noises  from  the  data  processors  in  the  syndrome.  This  means  that 
the  noise  from  the  checksum  processor  will  mask  the  small  faults  from  the 
data  processors  in  the  threshold  test.  The  data  processor  fault  size  hats 
to  be  signiflcant  before  it  is  detected.  One  can  eliminate  this  problem  by 
making  all  the  weights  comparable  in  magnitude  (for  example,  WN+i,k  =  1 
for  k  =  l,...,iV  and  ws+i,n+i  =  — l),  but  this  requires  adding  O(logiV) 
bits  in  the  checksum  processor  registers. 


3.3  Single  Fault  Correction 

3.3.1  The  Projection  Method 

In  the  single  fault  detection  system  with  one  checksum  processor,  we  have 
used  the  threshold  method  for  the  fault  diagnosis.  This  method  assumes 
that  small  syndrome  energy  is  likely  caused  by  the  numerical  error  and  large 
syndrome  energy  is  likely  caused  by  a  hardware  fault.  For  the  single  fault 
correction  system,  we  shall  use  a  similar  test  for  fault  detection  and  location. 
An  obvious  method  would  be  to  do  the  threshold  testing  on  each  of  the 
processors.  That  means  doing  the  threshold  test  on  the  syndrome  energy 
in  each  weight  vector  direction  in  the  syndrome  space.  Such  threshold  tests 
involve  projecting  the  syndrome  s  on  to  each  weight  vector  and  doing  the 
threshold  test  on  the  energy  of  the  projection.  If  the  syndrome  noises  are 
white,  we  can  use  Euclidian  projections,  but  if  the  syndrome  noises  are 
correlated,  we  should  compensate  for  covariance  of  the  syndromes  when 
finding  the  “projection”  .along  the  weight  vector. 

The  numerical  noise  variance  in  the  syndrome  space  can  be  calculated 
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zis  follows.  We  assume  that  the  numerical  noises  have  zero  mean  and  cer¬ 
tain  expected  variances.  When  there  is  no  fault  in  the  system,  the  input 
checksum  error  6*1  processor  noise  6*,  and  the  syndrome  computation 
noise  A*  are  no  longer  equal  to  zero  and  are  modeled  as  zero-mean  random 
variables.  We  shall  define  A*,  F*,  and  A*  to  be  the  variances  of  the  6*,  £*, 
and  A*.  Let  us  define  $*  as  the  variance  of  the  composite  error  <f>^. 


for  k  =  I,  ...,N 


[  Ft  +  FAkF^  +  At  for  k  =  N  +  1,  ...,N  +  C 


(3.4) 


Let  us  define  $  to  be  the  {N  -I-  C)q  by  (N  -I-  C)q  block  diagonal  matrix  with 
the  qhy  q  diagonal  blocks  $*. 


■  $1  0  •••  O' 

0  $2  *  ■ '  0 

^  ;  . 

0  0  •  •  •  $jv+c 

Then  the  variance  V  of  the  syndrome  s  is  equal  to 


(3.5) 


AT+C 

V  =  (3.6) 

m=l 

where  V  is  a  Cq  by  Cq  block  matrix. 

Now,  we  are  ready  calculate  the  projection  of  the  syndrome  onto  each 
weight  vector.  Let  us  define  a  norm  as 


IMIt-  (3.7) 

Let  us  also  define  as  the  processor  error  estimate  found  by  projecting 
the  syndrome  s  onto  the  A:*'*  weight  vector  with  respect  to  the  V“*  norm 


^  rain  ||s  +  Wt^^^llv-i  (3.8) 

tk 

where  is  the  possible  processor  error.  We  use  the  V~‘  norm  to  compen¬ 
sate  for  any  correlations  in  the  syndrome  outputs.  Solving  for  the  minimum, 
we  have 
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N 


(3.9) 


=  -  [wJ’V-‘Wk]'‘  for  k  =  1,..., 

The  <t>^  can  be  thought  of  as  the  processor  error  that  comes  closest,  in 
a  least  squares  sense,  to  “explaining”  the  syndrome  s. 

Let  Hq  represent  the  no-fault  hypothesis  and  Ht  represent  the  hypoth¬ 
esis  that  processor  k  has  failed.  The  most  likely  failure  hypothesis  Hk  can 
be  determined  by  computing  the  following. 


Lo  =  !k|lV-. 

Lk  =  min  |(s -h  j|y_i  for  A' =  1,  ...N -t- C  (3.10) 

The  Lk  is  equal  to  the  best  guess  of  the  computation  noise  under  hypothesis 
Hk-  The  Lk's  are  called  log  likelihoods  for  reasons  that  will  become  clear 
when  we  later  discuss  the  generalized  likelihood  ratio  test  which  gives  an 
identical  result  as  the  projection  method.  The  most  likely  failure  hypothesis 

A 

Hk  can  be  computed  by 


Hk^iMniLk  +  ik)  (3.11) 

where  act  as  scalar  thresholds.  These  thresholds  are  used  to  compen¬ 
sate  for  the  different  failure  rate  of  the  processors  (the  checksum  processor 
failure  rate  would  be  higher  than  the  data  processor  failure  rate,  since  it 
includes  the  failure  in  input  checksum  and  syndrome  calculations).  The  Lk 
can  be  rewritten  as 


ininlls  +  Wfc<p  llv-i 

I  (i-  Wt  [wfv-'Wj]"' Wfv-') 

2 

s 

V-1 

(3.12) 

Ikllv-i  - 

(3.13) 
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Using  this,  the  equations  for  the  log  likelihoods  can  be  simplified  by  defining 
the  relative  log  likelihood  L’^  to  be 


L't  =  Lo-Lk  +  i[  (3.14) 

where  Ik  =  'lo  ~  'Ik-  If  the  largest,  then  this  indicates  that  failure 

hypothesis  Ht  is  most  likely.  The  can  be  rewritten  as 

L'o  =  0 

^'k  —  + '7/1:  f®*"  k=l,...N  +  C  (3.15) 

Notice  that  this  is  equivalent  to  doing  the  energy  threshold  test  on 
the  projection  of  the  syndrome  onto  the  weight  vector. 

Once  the  fault  has  been  determined  to  be  in  the  k*^  data  processor,  its 

correct  output  can  be  calculated  by  subtracting  the  projection  from 
faulty  processor’s  erroneous  output. 

i  =  !?»  -  t  (316) 

The  numerical  noise  level  in  the  corrected  output  would  be  significantly 
larger  than  other  processors.  This  is  because  0  contains  not  only  the  fault, 
but  also  the  weighted  sum  of  the  numerical  noise  from  all  other  processors. 
Therefore,  is  equal  to  the  weighted  sum  of  the  numerical  noise  from  all 
other  processors. 

The  projection  method  for  single  fault  location  is  equivalent  to  dividing 
the  Cq  dimensional  syndrome  space  into  {N  +  C  + 1)  decision  regions.  Each 
decision  region  belongs  to  a  failure  hypothesis.  If  we  let  Hq  represent  the 
no  fault  hypothesis  and  Hk  represent  the  hypothesis  that  processor  k  has 
failed,  the  decision  region  of  hypothesis  Ho  would  be  around  the  origin,  and 
the  decision  region  of  the  hypothesis  Hk  would  be  around  the  weight 
vector.  For  example,  when  two  checksum  processors  are  used  to  check  two 
data  processor  with  weight  matrix 

11-10 
1-1  0-1 


W  = 
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(3.17) 


the  decision  regions  belonging  to  each  failure  hypothesis  when  g  =  1  are 
plotted  in  the  syndrome  space  in  figure  3.1. 

Setting  the  thresholds  tor  single  fault  correction  is  very  similar  to  the 
single  fault  detection  case.  For  example,  if  we  know  the  probability  distri¬ 
bution  of  the  normalized  projection  energy  ||  under  Hq,  then  we 

can  set  the  probability  that  exceeds  zero  under  Hq  by  adjusting  7]^.  The 
actual  false  alarm  rate  for  the  processor  k  is  lower  than  that,  since  one  has 
to  account  for  the  cases  when  the  numerical  noise  causes  more  than  one 
LJt  to  be  greater  than  zero.  This  happens  because  weight  vectors  are  not 
orthogonal  to  each  other. 

As  was  the  case  in  the  single  fault  detection  threshold  method,  the 
system’s  ability  to  detect  a  small  transient  fault  or  a  small  permanent 
fault  depends  very  much  on  the  output  batch  size  q.  The  mean  value  of 
the  normalized  projection  energy  ||Wjt^j^||v-i  grows  as  0(g)  where  as  its 
standard  deviation  grows  as  O(y^).  Therefore,  with  a  larger  q  the  system 
is  more  likely  to  detect  a  small  permanent  failure  affecting  many  of  the 
output  data  points,  but  it  is  less  likely  to  detect  a  transient  failure  affecting 
few  of  the  output- data  points. 

3.3.2  Effects  of  Incorrect  Fault  Diagnosis 

The  probability  of  making  the  correct  fault  diagnoses  can  be  calculated  by 
evaluating  the  following  integral 

Prob(Guess  [  Hk  is  correct)  =  f  p{s  |  Hk,^i^)ds  (3.18) 

^  J  Hm 

which  evaluates  the  probability  that  the  syndrome  under  Ht  would  fail 
under  the  decision  region  This  is  a  very  difficult  integral  to  evaluate. 

In  a  single  fault  correction  system  with  numerical  noise,  there  are  three 
types  of  incorrect  fault  diagnosis  that  can  be  made.  The  first  is  the  false 
alarm  case  in  which  the  large  numerical  noise  in  the  system  makes  one  or 
more  L'^  exceed  zero.  The  second  type  of  false  diagnosis  is  a  small  fault  that 
is  not  detected  by  the  system.  The  third  type  of  false  diagnosis  happens 
when  a  small  fault  has  occurred  in  one  processor,  but  the  numerical  noise 
has  pushed  the  syndrome  closer  to  another  weight  vector,  causing  that 
processor  to  be  diagnosed  as  faulty. 
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In  the  case  of  a  false  alarm,  one  of  the  processors  would  be  incorrectly 
diagnosed  to  be  faulty.  If  it  happens  to  be  a  data  processor,  its  output  is 
corrected  by  subtracting  However,  ^  under  Hq  is  equal  to  the  weighted 
sum  of  the  numerical  noise  from  all  the  processors.  Therefore,  the  cor¬ 
rected  processor  output  now  has  its  own  numerical  noise  subtracted  out, 
but  contains  the  weighted  sum  of  the  numerical  noise  from  all  other  proces¬ 
sors.  Furthermore,  the  numerical  noise  level  in  the  system  is  much  higher 
than  the  norm,  since  the  false  alarm  is  caused  by  a  large  numerical  noise. 
Therefore,  the  numerical  noise  level  in  the  corrected  output  is  much  higher 
than  the  noise  level  in  the  normal  processor  output. 

In  the  case  when  the  fault  is  small  and  does  not  get  detc-cted  by  the 
system,  the  effect  on  the  system  is  equivalent  to  a  slight  increase  in  the 
noise  level  in  the  faulty  processor.  However,  if  the  processor  error  energy 
level  is  too  small  to  be  detected,  it  should  not  affect  the  system  performance 
greatly. 

In  the  case  when  one  processor  has  a  small  fault,  but  the  numerical 
noise  pushes  the  syndrome  closer  to  another  weight  vector,  the  effect  is 
similar  to  increasing  the  noise  level  in  both  processors.  The  faulty  processor 
output  would  have  a  higher  noise  level  than  normal,  since  it  contains  a 
fault.  The  output  of  the  processor  that  is  falsely  diagnosed  to  be  faulty 
would  be  needlessly  corrected.  Therefore,  the  corrected  output  of  that 
processor  would  have  an  increased  noise  level,  similar  to  the  false  alarm 
case.  Again,  this  type  of  false  diagnosis  is  most  likely  only  when  the  fault  is 
small,  thus  the  system  performance  should  not  be  severely  degraded.  Notice 
this  increase  in  noise  level  only  applies  to  data  processors.  If  a  checksum 
processor  is  falsely  diagnosed  to  be  faulty,  its  output  is  not  corrected.  If  the 
small  non-detected  fault  were  in  a  checksum  processor,  it  does  not  affect 
the  data  processor  outputs. 

As  one  can  see,  the  system  is  likely  to  make  a  false  diagnosis  only  when 
the  fault  is  relatively  small,  and  the  net  effect  of  the  false  diagnosis  is  the 
increased  noise  level  in  some  of  the  data  processor  outputs.  One  can  also  re¬ 
duce  the  chances  of  misdiagnosis  by  choosing  the  weight  vectors  so  that  the 
angles  between  the  weight  vectors  are  as  large  as  possible  (i.e.  spread  them 
as  far  apart  from  each  other  as  possible).  When  the  angles  between  neigh¬ 
boring  weight  vectors  are  large,  the  decision  regions  associated  with  the 
weight  vectors  become  “fatter,”  and  there  is  less  chance  that  the  syndrome 
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of  a  small  fault  would  be  pushed  closer  to  another  weight  vector  line  by 
the  numerical  noise.  The  larger  angles  between  neighboring  weight  vectors 
also  mean  that  they  are  more  “orthogonal”  to  each  other.  Therefore,  there 
is  less  cross-covariance  between  the  syndromes  and  less  chance  of  confusing 
a  failure  in  one  processor  with  a  failure  in  another  processor.  The  lower 
cross-covariance  also  lowers  the  false  alarm  rate  since  the  numerical  noise 
from  other  processors  would  not  be  weighted  as  heavily  in  the  threshold 
tests.  With  a  lower  false  alarm  rate  and  a  lower  misdiagnosis  rate,  one  can 
also  afford  to  reduce  the  thresholds  and  increase  the  detection  sensitivity 
of  the  small  faults. 

It  is  possible  to  make  some  tradeoffs  between  the  chance  of  the  occur¬ 
rence  of  the  different  types  of  misdiagnosis  by  adjusting  the  thereby 
adjusting  the  size  of  the  decision  regions.  For  example,  increasing  the  mag¬ 
nitude  of  one  decreases  the  size  of  that  decision  region  and  increases 
the  sizes  of  all  the  neighboring  decision  regions.  This  decreases  the  prob¬ 
ability  of  a  false  alarm  in  that  processor  but  increases  the  probability  of 
a  false  alarm  in  the  processors  of  the  neighboring  decision  regions.  It  also 
increases  the  size  of  the  Hq  region,  thereby  increasing  the  chance  that  the 
small  fault  in  that  processor  escapes  detection.  If  we  increase  the  magni¬ 
tudes  of  all  the  '/i’s  by  the  same  amount,  the  decision  regions  would  still 
have  the  same  shape,  but  the  size  of  Hq  would  expand,  causing  less  false 
alarms,  but  increasing  the  chance  that  a  small  fault  may  escape  detection. 

In  deriving  the  projection  method  for  the  single  fault  correction  system, 
we  have  assumed  that  there  is  no  numerical  noise  involved  in  calculating 
L^’s.  However,  in  real  systems,  there  will  most  likely  be  additional  compu¬ 
tation  noise  in  the  calculation  of  the  L'^’s.  This  will  increase  the  probability 
of  fault  misdiagnosis  by  the  system,  and  also  increctse  the  numerical  noise 
in  the  corrected  processor  output. 

3.3.3  Dynamic  Range  vs.  Numerical  Noise 

The  dynamic  range  of  the  checksum  processor  is  an  important  consideration 
in  the  single  fault  correction  system.  The  dynamic  range  of  the  checksum 
processor  k  should  be  u)t  times  larger  than  the  data  processors,  where  ujt  is 
defined  in  equation  3.3.  Again,  when  floating  point  arithmetic  is  used,  the 
dynamic  range  is  not  generally  a  big  problem.  However,  when  fixed  point 
arithmetic  is  used,  the  checksum  processors  should  have  logj  ujt  more 
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bits  than  the  data  processors.  Otherwise,  the  weights  should  be  chosen  so 
that  ijJk  I-  Following  is  an  example  of  a  single  fault  correction  weight 
matrix  which  meets  such  a  condition. 


i  0 


1  1 
8  8 


0  0 


0  0 


I  1 
8  8 


i  _1  _i 
8-8  8 


0-1  0 


(3.19) 
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0 


1 

8 


i  1  _1  1  _1  i  _1 

8  8  8  8  8  8  S 


0  0-1 


However,  using  weights  on  the  order  of  1/N  effectively  weighs  the  numerical 
noise  from  the  checksum  processor  approximately  times  higher  than  the 
noise  from  the  data  processors.  This  means  that  the  noise  energy  from 
the  checksum  processors  will  heavily  mask  the  noise  energy  from  the  data 
processors  in  the  threshold  test,  and  thus  the  system  would  be  less  sensitive 
in  detecting  small  failures  in  the  data  processors. 


3.4  Simple  Case 

3.4.1  Simple  Case  of  Projection  Method 

The  computation  of  the  projection  method  becomes  especially  simple  when 
the  variance  matrix  V  is  a  scaled  identity  matrix.  There  are  two  sets  of 
assumptions  under  which  V  is  white. 

The  first  set  of  assumptions  is  that  there  is  no  numerical  error  in  input 
checksum  calculations  or  in  syndrome  calculations,  and  that  the  weight 
matrix  is  of  a  specific  form. 


r* 

=  oil 

A* 

=  0 

At  ■ 

=  Q 

WW^ 

II 

In  this  case,  V  is  equ?>  to 


V  =  <7^1  where  erj  = 


(3.20) 


(3.21) 
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For  example,  when  fixed  point  arithmetic  is  used  with  low  integer  weights, 
the  input  checksumming  noise  and  the  syndrome  calculation  noise  would 
be  equal  to  zero  and  this  condition  can  be  satisfied. 

The  second  set  of  assumptions  is  that  all  the  computational  noises  are 
white,  F  is  an  orthogonal  transform  (an  example  of  an  orthogonal  transform 
is  a  Fourier  Transform),  and  W  is  of  a  specific  form. 


Tk 

II 

II 

A* 

=  c'aI 

WW^ 

11 

FF^ 

=  oj.1 

In  this  czise,  V  is  equal  to 

V  =  ffyl  where  cry  =  -I-  cr^ +  Oa 


(3.22) 


(3.23) 


If  one  of  the  above  two  conditions  are  satisfied,  becomes 


T  /V+C 

-  £  ■ 

m=S+i 


where  rtm  is  equal  to 


(3.24) 


N+C 

rkm  =  i  ^l.kV}l,m  (3-25) 

^=^+l 

The  rtm  ca,n  be  thought  of  as  the  cross-correlation  between  the  A:*'*  and  m"' 
weight  vectors.  When  k  -  m,  fkm  can  be  thought  of  as  the  weight  vector 
magnitude  squared.  The  relative  likelihood  can  be  then  written  as 

,  N+c  fi+c 

Li  =  ^  E  E 

rkk<^V  l=N+\m=N+l 

One  needs  approximately  {N  +  C)C{C  +  1)9/2  multiply /adds  if  all  the 
likelihoods  are  computed  straightforwardly  with  this  formula.  The  number 
of  multiply /.ad^  can  be  minimized  if  the  L'^s  are  computed  in  the  following 
way.  Define  p/.m  as  ' 
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/  =  N  +  1,...,N  +  C 


Pl.m  =  ^  Sm  for 


(3.27) 


m  =  N  +  1,...,N  +  C 
Then  LJt  for  A:  =  1, N  can  be  calculated  as 


Pn+i,n+\ 

’  •  •  Pn+i,n+c 

^  ^N+l,k  ^ 

^k—  2  (^N+l.fc  •  •  •  ^N+C.k) 

^kk^V 

PN+2,N+1 

•  ■  '  PN+2,N+C 

^N+2,k 

.  Pn+c,n+\ 

•••  Ps+C,N+C  . 

^N+C.k  J 

+ik 


and  L'f.  for  k  =  N  +  1, N  +  C  can  be  calculated  as 


=  —Pk.k  + 
Tkk 


(3.28) 

(3.29) 


One  needs  approximately  C(C  +  \)q {2  multiply /adds  to  compute  the 
and  about  N[C  +  1)^/2  multiply/adds  to  compute  L'^’s.  This  is  substan¬ 
tially  less  than  the  direct  calculation  method.  If  one  of  the  data  processors 
is  faulty,  the  correct  value  is  calculated  as 


N+C 


it  =  +  -  IT  y^m.kSm 


(3.30) 


One  needs  approximately  Cq  multiply /adds  to  correct  the  faulty  processor 
output. 

Therefore,  in  order  to  detect  and  correct  the  fault  in  the  simple  white 
noise  case,  it  takes  approximately  {C+l)C{q+N)/2+Cq  multiply/adds.  It 
also  takes  approximately  CNp  multiply /adds  to  compute  the  input  check¬ 
sums  and  CNq  multiply /adds  to  compute  the  syndromes.  Let  us  assume 
that  the  hardware  module  that  detects  and  corrects  the  fault  from  the  syn¬ 
dromes  is  triplicated,  so  that  its  failure  rate  does  not  dominate  the  system 
failure  rate.  In  that  case,  the  total  number  of  multiply /adds  needed  for  the 
single  fault  correction  is  equal  to 


CT  -H  CN{p  +  q)  +  3{(C  +  1)C(9  +  N)/2  +  Cq] 


(3.31) 
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where  T  is  equal  to  the  number  of  multiply /adds  needed  to  compute  F. 
The  first  term  is  the  computation  associated  with  the  checksum  processor 
computation,  the  second  term  reflects  the  input  checksum  and  syndrome 
computations,  and  the  third  term  is  the  work  required  by  the  triplicated 
fault  detection/correction  algorithm.  We  can  define  the  computation  over¬ 
head  ratio  Re  to  be  the  ratio  between  the  computations  needed  for  fault 
tolerance  overhead  and  the  computations  needed  for  the  data  processors. 
Therefore,  we  divide  the  overhead  computation  by  NT  to  get  Rg. 


C  Cip  +  g) 

r 


(C  +  l)C(,  +  A0/2^  Cq 


NT 


NT 


(3.32) 


In  general.  Re  can  be  made  lower  by  decreasing  the  number  of  checksum 
processors  C  and  increasing  the  number  of  data  processors  N.  The  Re  is 
also  lower  when  Tfq  \s  higher.  When  the  computational  effort  involved 
in  the  computation  of  each  of  the  output  data  points  is  high,  the  fault 
detection/correction  effort  becomes  relatively  smaller. 


3.4.2  Computational  Overhead  Example 

We  can  further  reduce  the  computation  by  choosing  weights  to  be  small 
integers.  When  the  small  integers  are  used  as  weights,  the  number  of  mul¬ 
tiplies  needed  is  drastically  reduced  in  input  checksum  and  the  syndrome 
computations  as  well  as  in  the  fault  detection  and  correction  procedure. 
Multiplication  by  weights  0,  -1,  -l-l,  -2,  +2,  -1/2,  +1/2,  etc.  is  very  simple, 
requiring  much  less  computation  than  a  full  multiply.  Let  us  consider  the 
system  with  weight  matrix  in  equation  2.31  with  ten  data  processors  and 
three  checksum  processors  {N  =  10,  C  =  3).  Since  only  -1,  0,  and  +1  were 
used  as  the  weights,  there  would  be  no  multiplication  involved  with  the 
input  checksum  and  the  syndrome  calculations.  These  weight  vectors  are 
spread  relatively  far  apart  from  each  other  minimizing  the  chances  for  incor¬ 
rect  fault  diagnosis.  Furthermore,  the  weight  matrix  satisfies  W^W  =  91, 
one  of  the  conditions  required  to  achieve  uncorrelated  syndromes,  and  a 
simple  relative  likelihood  test.  However,  this  system  does  use  one  more 
checksum  processor  than  the  minimum  number  of  the  checksum  processors 
required  for  single  fault  correction. 


46 


Operation 

Multiplies 

Additions 

Input  Checksums 

0 

219 

Syndromes 

0 

249 

Syndrome  Correlations  pim 

69 

6(9  -  1) 

Relative  Likelihoods 

4 

35 

Choose  largest  likelihood 

0 

12 

Correction  of  Failed  Processor 

9 

39 

Table  3.1:  Number  of  Multiplies  and  Adds 


If  we  assume  that  all  the  numerical  noises  are  white  and  V  =  a^I,  the 
number  of  adds  and  multiplies  needed  for  input  checksum  and  syndrome 
calculation  and  for  the  fault  detection/correction  algorithm  are  listed  in 
table  3.1.  We  have  not  counted  multiplication  by  1/2  or  -1/2  (there  are  only 
6  such  operations).  We  have  also  used  a  recursive  summation  algorithm  to 
compute  L'^,  which  cuts  down  on  the  number  of  additions.  If  we  assume 
that  the  computation  involved  in  the  fault  detection /correction  algorithm  is 
triplicated,  the  computational  overhead  ratio  Rc  for  the  multiplies  becomes 


Rg  —  0.3  + 


3(79  +  4) 

lor 


and  for  the  additions  becomes 


(3.33) 


Re  =  0.3  + 


459  +  3(99  +  41) 

lor 


(3.34) 


The  factor  3  is  there  because  the  fault  detection/correction  computation 
must  be  triplicated. 

Let  us  pick  a  complex  FFT  as  the  task  F.  The  FFT  is  a  very  realistic 
example  of  the  task  that  the  high  performance  DSP  systems  has  to  perform 
in  real  time.  If  we  assume  that  all  the  numerical  noises  are  white,  then  V 
is  diagonal  since  the  FFT  is  an  orthogonal  transform  (F^F  =  I),  and  since 
WW^  =  all.  The  complex  FFT  of  length  9/2  consists  of  p  =  9  input  and 
output  real  data  points.  Computing  F  requires  2qlog2{<l/2)  real  multiplies 
and  39log2(9)  —  9  real  additions.  Therefore,  the  computational  overhead 
ratio  for  the  multiplies  is  equal  to 
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q 


"®"  Multiplies 
Additions 
Ideal 


Figure  3.2;  Overhead  Ratio  for  Complex  FFT  (N=10,  C=3) 


210  +  12 

Rc  —  0.3  +  — — j - 

20q\og2{q/2) 

and  for  the  additions 


(3.35) 


72q  +  123 

Rc  =  0.3  +  --r - TTT 

30^  log2  q  -  10<7 


(3.36) 


These  overhead  ratio  curves  are  plotted  in  figure  3.2  with  the  “Weal”  over¬ 
head  ratio  accounting  only  for  the  checksum  processor  computations. 


3.5  Other  Methods  Single  Fault  Correction 
Methods 

3.5.1  Using  Range  Test  on  The  Syndromes 

There  is  another  easy  method  for  single  fault  correction  when  only  -1, 
0,  and  +1  are  used  as  weights.  It  involves  performing  a  range  test  on 
the  individual  syndromes,  component  by  component.  Fault  diagnosis  is 


48 


carried  out  component  by  component  as  if  g  =  1.  Each  component  of  each 
syndrome  is  classified  into  one  of  three  ranges  of  values,  and  the  code  digit 
bm  is  assigned  to  each  syndrome  depending  on  its  range. 

(  +1  If  Sm  <  -7 

6„»  =  <  0  If  -n  <  <  +7  (3.37) 

(  -1  If  +7  <  Sm 

The  7  is  a  scalar  threshold.  If  there  is  no  fault,  all  the  6^  would  be  zero. 
If  there  is  a  fault,  the  faulty  processor  is  the  one  with  the  weight  vector 
i{bN+\,bN+2,—bN+c)-  Figure  3.3  shows  the  decision  regions  used  when 
there  are  two  data  processors  with  two  checksum  processors  with  the  weight 
vector  matrix 


W  = 


1  1-1  0 
1-1  0-1 


(3.38) 


This  range  test  method  involves  2Cq  comparison  tests  for  each  batch  of 
data  processing.  The  major  advantage  of  this  method  is  that  it  is  very 
simple  to  implement  and  requires  little  computation  with  no  multiplies. 

The  probability  that  a  syndrome  magnitude  exceeds  the  threshold  7  cam 
be  determined  by  integrating  the  probability  distribution  of  the  syndrome 
over  the  range  Sfc  <  —  7  and  s*  >  7.  Therefore,  the  false  alarm  rate  can  be 
set  by  adjusting  7  as  well.  Notice  that  if  the  7  is  not  set  large  enough,  it  is 
possible  to  misdiagnose  a  checksum  processor  fault  even  if  the  fault  is  very 
large.  This  is  because  the  checksum  processor  decision  regions  width  does 
not  increase  with  the  fault  size.  In  figure  3.3,  one  can  see  that  the  decision 
regions  for  the  failure  hypothesis  Hn+i  and  Hn+2  have  constant  widths.  On 
the  other  hand,  the  projection  method  decision  regions  are  all  pie-shaped 
(getting  wider  as  the  fault  size  increases)  as  shown  in  figure  3.1,  so  that 
the  probability  of  fault  misdiagnosis  decreases  as  the  fault  size  increases. 
Consider  the  case  when  the  checksum  processor  N  +  1  has  a  large  fault,  but 
the  numerical  error  pushes  the  syndrome  point  from  pi  to  p2  as  illustrated 
in  figure  3.3.  This  causes  the  system  to  diagnose  the  processor  1  as  being 
faulty.  If  ss+i  were  used  to  correct  the  fault  (j/i  =  1/1  +  S7v+i/tyjv+i,i).  the 
corrected  output  j/i  would  be  very  wrong. 

One  way  to  get  around  this  problem  is  to  correct  the  data  processor 
fault  using  the  smallest  syndrome  with  non-zero  weight. 
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(3.39) 


inin(sTO)  , 

Vk  =  y*  + - for  0 

^m,k 

Another  way  to  get  around  the  problem  is  to  increase  the  threshold  7,  and 
widen  the  checksum  processor  decision  regions  Wider  de¬ 

cision  regions  would  lower  the  probability  of  misdiagnosis  for  the  checksum 
processor  fault.  However,  the  larger  7  would  make  it  more  difficult  to  detect 
a  small  processor  fault.  This  could  be  a  problem  since  detecting  a  small 
fault  is  difficult  in  the  first  place,  due  to  the  fau:t  that  the  fault  detection  is 
done  on  a  point  by  point  basis  (as  if  9  =  1). 

3.5.2  Error  Coding  Method 

If  only  the  0  and  1  were  used  as  weights  with  the  range  method  discussed 
above,  the  method  is  equivalent  to  using  error  coding  technique  for  single 
fault  correction.  In  this  case,  it  is  possible  to  use  q  >  1.  The  syndrome 
energies  are  classified  into  two  categories  and  the  code  digit  is  assigned 
to  each  syndrome  depending  on  the  range. 

f  1  If  >  7 

bm  =  <  (3.40) 

I  0  If  sl^s^  <  7 

If  there  is  no  fault,  all  the  b^  would  be  zero.  If  there  is  a  fault,  the  faulty 
processor  is  the  one  with  the  weight  vector  (&Af+i ,  6w+2t  -^N+c)-  The  major 
disadvantage  of  this  method  is  that  it  requires  more  checksum  processors 
than  other  methods.  Another  problem  is  that  once  again  the  decision  re¬ 
gions  for  the  checksum  processors  are  narrow  even  when  the  fault  is  large. 
Therefore,  one  has  to  either  avoid  using  the  maximum  syndrome  for  fault 
correction  or  increase  the  threshold  7  and  reduce  the  probability  of  mis¬ 
diagnosis  for  the  checksum  processor  fault  as  was  the  case  in  the  previous 
range  test  method. 

3.5.3  Angular  Method 

Another  simple  method  for  single  fault  correction  is  to  use  the  slopes  be¬ 
tween  the  syndromes  for  the  fault  location.  This  method  also  requires  that 
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the  fault  diagnosis  is  done  point  by  point  (i.e.  as  if  g  =  l).  When  there  are 
only  two  checksum  processors,  the  evenly  spaced  {N  +  2)  weight  vectors 
are  chosen  as 


mTT 


mn 

m.N+2.m 


^N  +  2‘ 


(3.41) 


where  N  is  an  even  number.  The  weight  vectors  with  m  =  0  and  m  = 
{N  +  2)12  are  the  checksum  processor  weight  vectors.  The  fault  is  detected 
if  the  energy  of  the  syndromes  exceeds  the  threshold.  Once  a 

fault  is  detected,  the  faulty  processor  number  is  equal  to 


•  .  ^^  +  2 

m/  =  int  I - arctan 


(3.42) 


where  the  function  int()  is  the  roundoff  function  to  the  nearest  integer.  The 
decision  regions  of  this  method  are  drawn  in  figure  3.4.  In  actual  implemen¬ 
tation,  one  may  want  to  do  the  range  test  on  the  slope  sjv+2/sw  r!  instead 
of  computing  explicit  arctan.  This  would  require  log2(JV  +  C)  threshold 
tests  on  the  slope.  Since  computing  the  slope  requires  a  division  which 
is  computationally  costly,  the  slope  threshold  test  may  be  carried  out  by 
multiplying  s;v+i  by  the  test  slope  and  comparing  with  sjv+2.  In  the  sys¬ 
tems  with  more  than  2  checksum  processors,  the  angles  between  (C  —  l) 
syndrome  pairs  have  to  be  calculated. 
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Chapter  4 

Generalized  Likelihood  Ratio 
Test 

4.1  Single  Fault  Correction 

Musicus  and.  Song  [Musicus  88,  Song  87]  have  proven  that  when  the  nu¬ 
merical  noises  are  of  Gaussian  type,  the  projection  method  we  have  used 
for  single  fault  correction  is  equivalent  to  a  generalized  likelihood  ratio  test 
which  gives  a  “optimum”  or  “near  optimum"  performance.  The  detailed 
proof  in  their  paper  is  given  in  appendix  B,  and  the  following  is  the  sum¬ 
mary  of  their  derivation. 

Assuming  that  processor  failures  occur  independently  from  each  other, 
with  Pk  equal  to  the  probability  of  failure  of  processor  k,  then  the  proba¬ 
bility  of  failure  hypothesis  Ht  is  equal  to 

N+C 

pW  = 

m=l 

N+C 

=  p,  n  (1-^-)  (■'•>) 

m  =  1 
m  ^  k 

Note  that  the  probabilities  of  the  hypothesis  that  are  considered  do  not 
add  exactly  to  1  <  1).  This  is  because  only  the  no-fault  and  single 
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fault  cases  are  considered.  The  assumption  here  is  that  P*  <<  1  and  that 
the  probability  of  multiple  faults  occurring  is  negligible  compared  to  the 
probability  of  a  single  fault  occurring.  The  Pt  of  the  checksum  processors 
include  the  probability  of  failure  of  the  input  checksum  calculation  and  the 
syndrome  calculation. 

Under  Hq,  the  syndromes  s  are  modeled  as  zero-mean  Gaussian  random 
variables  with  variance  V 


p{s\Ho)  =  N{0,V)  (4.2) 

where 


w+c 

V  =  W$W‘  =  5;  (4.3) 

m=l 

as  defined  before. 

Under  presence  of  a  processor  fault,  the  model  is  modified  as  follows. 
Under  Ht,  is  modeled  as  a  Gaussian  random  variable  with  an  unknown 
mean  and  known  covariance  $*.  The  is  to  be  estimated  from  the 
syndrome.  This  estimation  is  necessary  because  we  do  not  have  the  proba¬ 
bility  distribution  of  the  fault  available.  With  this  model  for  under  Hk . 
the  probability  distribution  of  the  syndrome  s  under  Hk  is 


p(5  I  Hk,^^) 


N 


7V+C 


m  =  1 
m  /  Jk 


N  V  -  WjA^.Wj') 


(4.4) 


where  is  the  difference  between  the  processor’s  working 

covariance  and  the  failure  covariance. 

A  Generalized  Likelihood  Ratio  Test  (GLRT)  is  then  used  to  find  the 
most  likely  failure  hypothesis  which  would  explain  the  values  of  the  syn¬ 
drome  s.  Let  Lfc  be  the  log  likelihood  of  5  and  hypothesis  Hk- 


Lo  =  logp(s,/fo) 
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(4.5) 


Lk  =  max  log  p(5, /ft  I  0  )  for  k  ~  +  C 

Notice  that  the  Lk  for  k  =  1, JV  +  C  must  be  mziximized  over  the  failure 
size  This  is  because  the  actual  processor  error  is  not  known  and  must 

be  estimated  from  the  syndrome  values.  Let  be  the  value  that  maximizes 

the  k*^  likelihood.  The  can  also  be  thought  of  as  the  most  likely  value 
of  the  processor  error  that  would  have  caused  the  syndrome  values  to  be 
as  they  are,  if  the  processor  k  were  faulty. 

A 

The  ^  can  be  found  by  using  Bayes’  Rule,  substituting  Gaussian  den¬ 
sities  into  the  likelihood  formulas,  and  then  solving  for  the  maximum  Lk 
with  respect  to  The  resulting  is  exactly  the  same  as  the  projection 
of  the  syndrome  s  onto  the  k*'^  weight  vector  direction  with  respect  to  the 
V”*  norm  as  we  found  in  the  projection  method. 

^  min  ||s  -I-  WfcJ^ilv-i  (4.6) 

ti 
or 

5*  =  -[w[V-'Wfc]"'wJ'V-'5  for  k=l,...,N  +  C  (4.7) 
Substituting  this  into  Lk, 

Lo  =  --lUllv-i  +  70 

Lk  =  -^||s -I- Wi5^||v-i  +  7k  for  k  =  1,..., TV  +  C  (4.8) 

where  the  constants  do  not  depend  on  s.  If  we  define  relative  likelihood 
L'k  (relative  to  /To)  as 


L’k  =  2{Lk  -  Lo)  (4.9) 

then  we  have 
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0 


^0  = 

I 

J-'k  =  llWfcljv- +711  for  A:- ?,...,7V  +  C  (4.10) 

where 


7;= -log|I-Wj'V-‘W*A«,|+2log(j^)  (4.11) 

These  formulas  are  exactly  what  we  derived  in  the  projection  method  sec¬ 
tion  except  for  the  fact  that  the  likelihood  ratio  test  arrives  at  certain  fixed 
constants  for  the  thresholds. 

Although  the  generalized  likelihood  ratio  test  is  designed  to  give  the 
most  likely  failure  hypothesis  for  the  syndromes  with  these  constants,  it  is 
not  clear  that  these  constants  are  appropriate.  The  reason  is  that  the  pro¬ 
cessor  failure  model  was  not  known  in  advance  and  thus  the  failure  size  had 
to  be  Jointly  estimated  along  with  the  hypothesis  from  the  syndromes.  This 
is  more  of  an  ad  hoc  likelihood  ratio  test  in  which  some  the  probabilities 
are  known  in  advance,  but  the  parameters  have  to  be  guessed. .  Therefore, 
although  the  form  of  the  likelihood  equations  is  desirable,  their  constants 
may  not  be  the  most  useful.  The  fact  that  the  GLRT  does  not  give  the 
exact  constants  is  not  very  relevant  in  application,  since  we  would  like  to 
exploit  the  selection  of  constants  to  achieve  the  desired  tradeoff  between 
various  probabilities  of  misdiagnosis. 


4.2  Reducing  The  Numerical  Noise 

So  far,  we  have  used  syndromes  exclusively  for  fault  detectipn/correction 
purposes.  However,  in  some  ceises,  it  is  possible  to  also  use  the  syndromes 
for  reducing  the  numerical  noise  in  the  outputs  of  the  data  processors.  The 
derivation  of  the  method  involves  a  modified  generalized  likelihood  ratio 
test  derived  by  Musicus  and  Song  [Musicus  88,  Song  87] .  The  detailed  proof 
in  their  paper  is  given  in  appendix  C,  and  the  following  is  the  summary 
of  their  derivation.  In  the  modified  generalized  likelihood  ratio  test,  the 
outputs  of  the  data  processors  as  well  as  the  syndromes  are  used  in  order 
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not  only  to  detect  and  correct  the  faulty  processor,  but  also  to  estimate  the 
correct  output  of  the  working  data  processors  and  thus  possibly  reduce  the 
noise  level  in  the  outputs. 

Let  us  define  as  what  the  processor  output  should  be. 

=  (4.12) 

Let  us  also  define  y  as  the  block  vector  of  the  actual  processor  outputs 
and  y  as  the  block  vector  of  what  the  correct  output  should  be.  These  Nq 
length  block  vectors  are 


(4.13) 

The  log  likelihood  Lk  in  the  modified  generalized  likelihood  ratio  test  are 
given  as 

Lq  =  max  log  p(y,  s,  I  y)  (4-14) 

y  “  - 

Lk  =  max  max  log  p(y,  s, /ffc  I  y ,  for  k  =  +  C  (4.15) 

«  l,  -  - 

The  Ljt’s  are  maximized  over  the  correct  output  y  and  the  processor  error 

_  JL.  — 

Let  p  and  be  the  values  of  y  and  that  maximize  Ljt .  The  y  and 
can  also  be  thought  of  as  the  most  likely  values  of  the  correct  outputs  and 
the  processor  error  that  would  cause  the  observed  syndromes,  the  processor 
outputs,  and  the  failure  Hk-  The  y  and  are  again  calculated  by  using 
Bayes'  rule,  substituting  Gaussian  densities,  and  solving  for  the  maximum. 
The  most  likely  is  given  by 

(4.16) 


=  -  [WJV-*  W*]  ‘  W[V-‘s  for  ife  =  1, ...,  N  +  C  (4.17) 


4-  min  ||s  +  Wfc^^llv-i 

h 

or 
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exactly  as  in  the  previous  likelihood  ratio  test.  Under  the  hypothesis  Hq, 
y  is  equal  to 


L  =  I'm  +  for  m  =  (4.18) 

and  under  the  hypothesis  Hk  for  A:  =  1, ...,  JV  +  C,  y  is  equal  to 

!/„  +  «„Wj;v->(i  +  w4j  for  m  =  m#* 

(4.19) 

A 

for  m  =  k,  {l  <  m  <  N) 

The  estimate  of  the  non-faulty  processor  is  the  sum  of  the  processor 
output  y^^  and.  the  “adjustment  term”  calculated  from  the  syndromes  for 
reducing  the  numerical  noise. 

Substituting  these  values  into  Lt  gives  exactly  the  same  result  as  the 
former  likelihood  calculation  but  with  different  constants  7^.  Converting 
Lk  into  relative  likelihood  the  constants  are  now  equal  to 

7fc  = -log|^fc$fc^|  +  21og(^^3^)  (4.20) 

Therefore,  this  modified  likelihood  ratio  test  works  exactly  like  the  former 
likelihood  ratio  test  and  projection  method  but  with  different  threshold 
constants.  However,  it  also  enables  us  to  estimate  the  working  processors’ 
correct  output  values,  as  well  as  the  failed  processor’s  correct  output  values. 

Appendix  E  shows  that  the  numerical  noise  in  y^^  has  zero  mean.  Intu¬ 
itively,  the  variance  of  y^  should  be  reduced  by  a  factor  of  0{C/N  +  C), 
provided  that  there  is  no  numerical  noise  involved  in  the  noise  reduction 
process. 

Although  the  fact  that  the  correct  processor  output  can  be  estimated  is 
theoretically  interesting,  it  may  not  be  very  useful  in  many  practical  cases. 
First,  0{C/{N  -H  C))  reduction  is  not  that  significant,  since  we  usually 
want  to  keep  C  small  coinpared  to  N.  S\ich  small  improvement  in  noise 
level  may  not  justify  the  increase  in  the  computational  costs,  especially  if 
the  computation  has  to  be  done  in  the  Triple  Modular  Redundancy  form. 
Moreover,  additional  numerical  error  incurred  during  the  noise  reduction 
computation  may  reduce  the  benefits  even  further.  In  cases  when  the  mag¬ 
nitude  of  some  of  the  weights  are  significantly  larger  than  some  others  (as 
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was  the  case  when  the  checksum  processor  dynamic  range  was  same  as  the 
data  processors),  the  noise  from  some  processors  would  be  weighted  very 
heavily,  and  attempting  to  reduce  the  numerical  noise  using  the  syndromes 
may  actually  increase  the  noise  level  in  the  outputs  of  the  processors  with 
the  small  weights. 
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Chapter  5 

Fixed  Point  System 
Simulations 


5.1  Numerical  Noise 

In  deriving  the  single  fault  correction  nmethods  in  the  presence  of  numer¬ 
ical  noise,  the  projection  method  and  the  generalized  likelihood  ratio  test 
yielded  the  same  result.  The  relative  log  likelihood  in  these  methods 
consists  of  the  normalized  energy  in  the  projection  of  the  syndrome  s  onto 
the  fc*'*  weight  vector  with  respect  to  the  variance  V,  plus  the  threshold 
constant  7]^.  Although  the  likelihood  ratio  test  theory  suggests  a  specific 
threshold  7^  for  each  relative  log  likelihood,  these  suggested  values  are  not 
necessarily  appropriate  for  all  applications.  The  major  reason  is  that  the 
probability  distribution  of  the  fault  was  not  known  for  the  likelihood  ra¬ 
tio  test.  We  had  to  make  an  ad  hoc  modification  to  the  test  so  that  the 
fault  had  a  Gaussian  distribution  with  a  mean  which  had  to  be  estimated 
from  the  syndromes.  Therefore,  although  the  form  of  the  log  likelihoods 
are  intuitively  reasonable,  the  specific  threshold  constants  may  not  be  so 
accurate. 

One  also  wants  to  be  able  to  change  the  values  of  the  thresholds  to 
make  tradeoffs  between  the .  probabilities  of  different  fault  misdiagnosis. 
For  example,  one  would  want  to  make  a  tradeoff  between  the  probability 
of  fault  detection  and  the  probability  of  false  alarm.  This  can  be  achieved 
through  appropriate  changes  in  ^j^’s. 
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Another  reason  why  the  thresholds  derived  in  the  generalized  likelihood 
ratio  test  may  not  be  appropriate  is  that  the  derivation  assumes  the  nu¬ 
merical  noise  to  be  Gaussian.  Numerical  noises  are  often  non-Gaussian, 
and  the  probability  distribution  of  the  numerical  noises  usually  depends 
on  the  processing  tasks.  Although  we  expect  the  numerical  noise  distribu¬ 
tion  from  the  processors  to  look  similar  to  a  Gaussian  distribution  in  most 
cases,  some  processing  tasks  may  generate  numerical  noise  whose  probabil- 
itj'  distribution  deviates  significantly  from  Gaussian.  Even  for  the  cases  in 
which  the  noise  distribution  is  close  to  Gaussian,  the  probability  distribu¬ 
tion  often  differs  substantially  from  Gaussian  in  the  tails.  For  example,  the 
actual  noise  probability  distribution  is  zero  in  the  range  beyond  the  maxi¬ 
mum  possible  numerical  noise  value.  These  deviations  in  the  tail  ends  can 
affect  the  probability  of  fault  detection  and  the  probability  of  false  alarm 
significantly.  Because  the  thresholds  are  set  to  discriminate  between  high 
noise  and  failure,  the  shape  of  the  noise  distribution  in  the  tails  is  crucial 
to  setting  an  appropriate  threshold  for  balancing  the  various  probabilities 
of  misdiagnosis.  These  probability  distribution  deviations  can,  however,  be 
compensated  for  by  appropriate  choices  of  the  threshold 

In  this  section  we  shall  use  simulations  to  study  the  workings  of  the  pro¬ 
jection  fault  detection/correction  method  and  the  adjustments  one  needs 
to  make  to  the  7J.  in  real  fixed  point  systems.  First,  we  shall  simulate  the 
projection  method  in  an  example  fixed  point  single  fault  correction  system 
assurning  ail  the  numerical  noises  are  Gaussian.  The  variables  that  were 
difficult  to  predict  accurately,  such  as  the  probability  of  false  alarm  and 
the  numerical  noise  in  the  corrected  processor  output  were  observed.  We 
also  examine  how  the  fault  detection/correction  algorithm  is  dependent  on 
various  parameters  such  as  the  batch  size,  the  processor  reliabilities,  and 
the  noise  level. 

Second,  we  model  fixed  point  roundoff  as  a  statistical  process  and  sim¬ 
ulate  the  numerical  noise  probability  distribution  in  the  syndromes.  We 
shall  show  how  the  tail  ends  of  such  a  probability  distribution  deviate  from 
Gaussian  and  show  how  7]^  can  be  adjusted  to  compensate  for  these  devia¬ 
tions. 

Third,  we  shall  examine  the  probability  distribution  of  the  numerical 
noises  in  the  syndromes  of  real  systems  by  simulating  a  fixed  point  single 
fault  detection  system.  The  purpose  of  this  simulation  is  to  see  whether  the 
roundoff  operations  are  indeed  independent  statistical  processes  as  was  as- 
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sumed  in  the  previous  simulation.  The  fixed  point  Finite  Impulse  Response 
Filter  and  Fast  Fourier  Transform  were  the  computational  tasks  simulated. 
We  shall  examine  how  the  numerical  noise  probability  distribution  deviates 
from  theory  in  real  systems,  and  how  one  can  compensate  by  adjusting  7]^. 
Even  though  our  study  is  limited  to  fixed  point  systems,  similar  studies  can 
be  done  on  floating  point  systems  as  well. 


5.2  Single  Fault  Correction  System  Simula¬ 
tion 

In  this  section,  we  have  simulated  a  fixed  point  single  fault  correction  sys¬ 
tem  using  the  projection  method  in  order  to  observe  the  variables  that  are 
difficult  to  predict  cinalyticaily  such  as  the  false  alarm  rate.  In  the  first  sim¬ 
ulation,  we  simulate  a  fixed  point  single  fault  correction  system  for  a  large 
number  of  batches  in  order  to  collect  an  accurate  histogram.  As  predicted, 
the  actual  false  alarm  rate  is  less  than  p(Lj  >  0  |  Hq),  the  probability  that 
L\  exceeds  zero,  and  the  processor  false  alarm  rate  depends  on  the  size  and 
the  shape  of  the  decision  region.  The  numerical  noise  filtering  algorithm 
in  section  2.2  was  also  applied  in  this  simulation  and  found  that  the  noise 
reduction  rate  is  very  close  to  the  predicted  value. 

In  the  second  simulation,  we  simulated  the  same  system  using  differ¬ 
ent  batch  sizes  q.  We  found  a  slight  increase  in  the  false  alarm  rate  with 
increasing  q,  which  was  not  predicted  analytically.  However,  this  depen¬ 
dence  on  q  was  found  to  be  weak.  The  numerical  noise  in  the  corrected 
output  in  the  case  of  a  false  alarm  decreases  with  increasing  q,  as  predicted 
analytically. 

In  the  third  simulation,  we  simulated  the  same  system  using  different 
p{Lt  >  0  I  Ho)  (i.e.  different  7^).  We  found  that  the  processor  false  alarm 
rate  is  closer  to  >  0 1  Hq)  for  smaller  values  of  p{L‘i^  >  0 1  Hq).  However, 
again  this  dependence  was  weak.  The  numerical  noise  in  the  corrected 
output  in  the  case  of  a  false  alarm  increases  with  decreasing  p(L'^  >  0 1  Ho), 
as  predicted  analytically. 

Therefore,  we  conclude  that  the  false  alarm  rate  p{ChooseHk  \  Hq)  is  less 
than  p{L'if  >  0  |  Hq)  and  the  difference  is  dependent  on  q  and  the  value  of 
p{L\  >  0  I  Ho).  However,  these  dependencies  are  weak  and  the  false  alarm 
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rate  p(Choose  Hk  j  Hq)  and  p(L^  >  0 1  Ho)  are  within  an  order  of  magnitude. 
Therefore,  it  is  good  practice  to  use  p(L^  >  0|/fo)  as  the  desired  false  alarm 
rate  in  designing  systems. 

5.2.1  Simulated  System 

The  system  simulated  is  a  fixed  point  single  fault  correction  system,  based 
on  a  weighted  checksum  and  the  projection  method.  It  consists  of  10  data 
processors  and  3  checksum  processors  (TV  =  10,  C  =  3)  with  the  weight¬ 
ing  matrix  given  in  equation  2.31.  All  the  numerical  noises  are  assumed 
to  be  white  Gaussian  random  variables.  We  shall  first  derive  the  appropri¬ 
ate  values  for  -7]^  and  then  simulate  the  projection  fault  detection/correction 
method  to  observe  the  variables  that  are  difficult  to  predict  accurately,  such 
as  the  probability  of  false  alarm  and  the  numerical  noise  in  the  corrected 
processor  output.  We  also  examine  how  the  fault  detection/correction  al¬ 
gorithm  is  dependent  on  various  parameters  such  as  the  batch  size,  the 
processor  reliabilities,  and  the  noise  level. 

Since  all  the  weights  are  -1,  0,  and  -hi,  there  is  no  numerical  noise 
involved  in  the  input  checksum  calculation  or  in  the  syndrome  calculation. 
This  means  that 


Tk  =  oil 

At  =  0  (5.1) 

A*  =  Q 

Since  WW^  =  aj,,  according  to  section  3.4.1,  the  variance  of  the  syndrome 
is  equal  to 

V  =  Oyl  where  =  o^^al  (5.2) 

Now  the  relative  log  likelihood  can  be  rewritten  as 

N+C  N+C 

=  ——2  H  ^l.kWmk^  +  ik  (5-3) 

TkkOy  j=7V  +  lm=JV+l 

This  equation  can  also  be  written  as 


64 


1 


q  N+C 

El  E  +  Trl 


j  =  l  m=W  +  l 


(5.4) 


The  expected  value  of  the  relative  log  likelihood  under  hypothesis  Hq  is 
equal  to 


E(L]i.)  —  +  'Ik 

=  g  +  'i'k 

which  is  consistent  with  the  result  in  appendix  D.  The  variance  under  Hq 
can  be  calculated  as 

Var(£i)  =  E(L«)  -  (E(£l))» 

=  (5.6) 

=  2, 

The  first  term  of  equation  5.4  (all  except  the  term  ')[)  is  equivalent  to  the 
sum  of  the  squares  of  the  Gaussian  variables  w„i^kSm,}  normalized 

by  its  variance.  It  is  thus  equivalent  to  the  sum  of  the  squares  of  q  unit 
variance  Gaussian  variables.  The  probability  distribution  function  for  the 
sum  of  the  squares  of  q  unit  variance  Gaussian  variables  is  the  chi-square 
probability  function  with  q  degrees  of  freedom. 

The  chi-square  probability  function  is  defined  as  follows.  Suppose  Xi, 
Xzj—tXq  are  independent  Gaussian  random  variables  with  zero  mean  and 
unit  variance.  Then  X^  =  Xf  is  said  to  follow,  the  chi-square  distri¬ 
bution  and  the  probability  that  X^  exceeds  is  given  by 

p(X*  >  x’  I  «)  =  |2’'’r(|)|-'  (5.7) 

The  expected  value  of  X*  is  equal  to  q.  With  large  q,  chi-square  distribution 
can  be  approximated  by  Gaussian  distribution  of  mean  q  and  variance  2q. 

Since  the  probability  distribution  of  Is  equal  to  a  chi-square  distri¬ 
bution,  we  can  choose  in  order  to  set  p{L'i^  >  0  j  Hq)^  the  probability  that 
L'^  exceeds  zero.  The  7]^  in  this  case  is  equal  to  the  negative  of  the  value 
X^  needed  to  achieve  the  probability  p{X^  >  1  <?)  =  p{^'k  >  0  I  ^o)- 
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')'k  =  -X^  for  \q)  =  p(L\>  0\Ho)  (5.8) 

Values  for  ')[  can  be  calculated  from  chi-square  distribution  function  tables. 

5.2.2  The  False  Alarm  Rate 

Although  we  can  calculate  and  set  p(Lj  >  0  |  Hq),  the  probability  that 
the  log  likelihood  exceeds  zero  under  the  no-fault  condition,  we  cannot 
easily  calculate  p(Choosefi^fc  |  Hq)^  the  actual  probability  that  a  false  alarm 
will  occur  in  that  processor.  This  is  because  the  weight  vectors  are  not 
orthogonal,  the  likelihoods  are  correlated  with  each  other,  and  the  large 
numerical  noise  can  cause  more  than  one  relative  log  likelihood  to  exceed 
zero.  The  probability  of  false  alarm  can  be  calculated  by  evaluating 

N-iC  . 

p(false  alarm)  =  X]  /  P(-  i  ^o)ds  (5.9) 

m=l 

which  expresses  the  probability  that  the  syndrome  under  Ho  will  fall  in 
other  decision  regions.  However,  calculating  this  integral  is  non-trivial,  and 
we  have  instead  determined  the  probability  of  false  alarm  by  simulation. 

The  false  alarm  rate  is  an  important  design  parameter,  since  a  false 
alarm  can  cause  a  data  processor  output  to  he  needlessly  corrected,  and 
the  numerical  noise  in  the  corrected  output  is  most  likely  much  larger  than 
the  processor  numerical  noise.  There  is  also  a  tradeoff  between  the  false 
alarm  rate  and  the  size  of  the  fault  that  can  escape  detection. 

In  these  simulations,  only  the  no-fau't  cases  {Ho  cases)  are  simulated 
because  we  do  not  have  a  reliable  model  for  processor  failures.  If  a  fault 
model  were  available,  we  could  have  simulated  the  probability  of  fault  mis¬ 
diagnosis.  However,  the  probability  of  fault  misdiagnosis  is  most  likely 
much  lower  than  the  false  alarm  rate  for  the  following  reason.  Only  a  small 
fault  is  likely  to  escape  detection  or  to  be  misdiagnosed  as  some  other  pro¬ 
cessor’s  fault.  Since  the  probability  of  a  fault  occurring  is  very  low  and  the 
probability  of  a  fault  being  small  is  most  likely  very  low,  the  probability 
of  a  small  fault  occurring  is  very  low  indeed.  Therefore,  the  probability  of 
fault  misdiagnosis  is  most  likely  much  lower  than  the  false  alarm  rate,  and 
thus  is  not  a  very  meaningful  design  parameter.  Besides,  the  effect  of  the 
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small  fault  in  either  misdiagnosis  case  is  equivalent  to  an  increase  in  the 
noise  level  in  some  of  the  processors,  and  is  usually  not  fatal  to  the  system 
application. 

Instead  of  simulating  a  particular  processing  task,  we  assumed  that  all 
inputs  are  equal  to  zero  (x^  =  0).  Thus,  all  the  processors  generate  signals 
Uic  —  Q  with  added  white  Gaussian  noise  with  variance  as  outputs. 
Each  syndrome  is  equal  to  a  weighted  sum  of  the  processor  noises  eind  is 
zero  mean  because  of  the  no-fault  condition.  The  white  Gaussian  noise 
was  generated  using  pseudo  random  numbers.  From  the  syndromes,  the 
relative  log  likelihoods  jLJt’s  were  calculated.  The  largest  relative  likelihood 
was  then  found  in  order  to  classify  the  fault.  If  all  were  negative,  then  we 
concluded  that  there  is  no  fault. 

In  these  experiments,  the  thresholds  are  set  such  that  the  probabil¬ 
ities  p{L'i^  >  0  j  ^o)  are  fairly  high  (around  the  0.01  to  0.1  range  rather 
than  the  more  realistic  10“^°  to  10“^°  range)  so  that  we  can  get  meaningful 
histograms  without  doing  the  simulation  for  an  enormous  number  of  repe¬ 
titions.  All  p(L'^  >  0  I  Ho)  were  set  to  an  identical  value  in  all  processors. 
Therefore,  all  were  set  to  an  identical  value  as  well. 

Table  5.1  shows  the  simulation  result  histogram  for  500,000  trials  with 
Or  —  10  Isb.,  9  =  10  and  the  p{L]t  >  0  |  Hq)  =  0.1.  As  one  can  see, 
>  0  I  Ho),  the  number  of  times  that  each  relative  likelihood  exceeds 
zero,  is  very  close  to  the  predicted  value  50,000.  This  also  corresponds  to 
the  fact  that  (L^),  the  average  value  of  relative  likelihood,  is  very  close 
to  -5.987,  as  predicted  by  equation  5.5.  However,  # (Choose  Hk  \  Hq),  the 
number  of  false  alarms  on  processor  k,  is  around  17,305  for  k  =  1,...,6, 
around  16,334  for  k  =  6, ...,  N,  and  around  20,339  for  A:  =  -I- 1, ...,  N  +  C. 

The  reason  is  that  the  weight  vectors  have  different  sets  of  neighboring 
weight  vectors  and  differently  shaped  decision  regions.  By  neighbors  we 
mean  abutting  decision  regions.  (The  fault  location  algorithm  is  most  likely 
to  confuse  between  the  faults  in  neighboring  decision  regions.)  When  the 
angles  between  neighboring  weight  vectors  are  small,  the  decision  region  is 
“skinny” ,  and  has  a  smaller  probability  of  correctly  diagnosing  the  failure 
and ‘a  smaller  likelihood  of  picking  this  vector  when  Ho  is  true.  Therefore, 
one  would  expect  that  the  closer  the  neighboring  weight  vectors  are,  the 
greater  the  difference  between  the  actual  false  alarm  rate  p(Choose/fjt  I  Ho) 
and  the  probability  p{L'i^  >  0  |  Hq).  Table  5.2  has  the  angles  between  the 
neighboring  weight  vectors.  The  6k,m.,  the  angle  between  weight  vectors  k 
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k 

#  (Choose  Hk  1  Ho) 

#(Li>0  ffo) 

0 

269,820 

0 

O.OOOOOOe+00 

1 

17,049 

49,933 

-5.974040e+00 

2 

17,379 

50,172 

-5.982454e+00 

3 

17,467 

50,513 

-5.976576e+00 

4 

17,444 

50,595 

-5.966286e+00 

5 

17,202 

50,158 

-5.984101e+00 

6 

17,286 

50,366 

-5.977914e+00 

7 

16,383 

50,002 

-5.979583e+00 

8 

16,327 

50,438 

-5.968598e+00 

9 

16,267 

50,218 

-5.978332e+00 

10 

16,360 

50,345 

-5.981068e+00 

11 

20,163 

50,181 

-5.987824e+00 

12 

20,348 

50,285 

-5.968671e+00 

13 

20,505 

50,426 

-5.974191e+00 

Table  5.1:  Simulation  Histogram  of  Single  Fault  Correction  (N=10,  C=3) 
and  m,  is  defined  as 


Weight  vectors  k  =  each  have  two  weights  of  ±1  and  one  zero  and 

have  two  neighboring  vectors  of  35.3  degrees  and  two  neighboring  vectors 
at  45  degrees.  The  weight  vectors  for  k  =  6,  each  have  three  ±1 
coefficients  with  three  neighboring  vectors  of  35.3  degrees.  The  weight 
vectors  of  A:  =  AT  +  1,  ...,N  +  C  (the  checksum  processors)  each  have  only 
one  -1  coefficient  and  two  zero  coefficients  and  have  four  nearest  neighbors 
at  45  degrees.  Thus  if  all  threshold  constants  are  equal,  then  decision 
regions  Hn+i,  ...,  Hn+c  are  the  “fattest”  and  the  most  likely  to  be  confused 
with  Ho',  regions  Hi,..., Ho  are  next  in  size  and  have  medium  false  alarm 
rates,  while  regions  Ht,...Hn  are  the  skinniest  and  have  the  least  false 
alarm  rate.  This  is  despite  the  fact  that  #{L'^  >  0  |  Ho)  is  the  same  for  all 
processors. 
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1 

2 

3 

4 

5 

6 

8 

10 

11 

12 

13 

■ 

35.3 

B 

45 

45 

2 

35.3 

45 

n 

3 

35.3 

|kk»I 

45 

o 

m 

B 

35.3 

45 

13 

5 

35.3 

35.3 

45 

m 

6, 

35.3 

35.3 

45 

la 

m 

35.3 

1 

35.3 

35.3 

1 

8 

35.3 

35.3 

35.3 

9 

35.3 

35.3 

35.3 

10 

35.3 

35.3 

35.3 

■ 

tm 

45 

45 

■ 

m 

im 

45 

45 

45 

45 

45 

45 

45 

Table  5.2:  Angle  Between  Neighboring  Weight  Vectors 

While  we  were  doing  the  simulation  for  the  false  alarm  late,  we  also 
attempted  to  use  the  syndromes  to  filter  some  of  the  data  processors’  nu¬ 
merical  noises.  We  have  assumed  the  no-fault  condition  Hq  for  all  cases, 
and  calculated  the  data  processor’s  correct  output  y  as 

^2  N+C 

H  for  k  = 

m=N+l 

The  observed  ratio  between  the  corrected  output  noise  variance  Var(y^  — 
y^),  where  y^^  =  0,  and  the  original  noise  variance  Var(yj^)  is  equal  to 

^  =  0.73416  (5. 

This  is  very  close  to  the  predicted  value  N/{N  -|-  C)  «  10/13  =  0.76923. 
The  noise  reduction  rate  with  this  method  is  marginal  and  may  not  justify 
the  extra  computation  (adding  just  one  bit  to  processor  registers  would 
reduce  the  noise  variance  by  a  factor  of  0.25). 


Var(y^  - 

Var(yj 


(5.11) 
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5.2.3  Dependence  on  q 

This  simulation  examines  the  dependency  of  the  projection  method  on  the 
output  batch  size  q.  Table  5.3  shows  a  set  of  simulations  with  five  different 
values  of  q.  For  each  q  the  histogr2an  was  compiled  with  10,000  simula¬ 
tions.  The  thresholds  were  set  so  that  the  probability  that  each  relative 
likelihood  exceeds  zero  were  all  set  to  p(Lfc  >  0  |  ffo)  =  0.1.  The  processor 
noise  standard  deviation  was  set  to  <7r  =  10  Isb.  Note  that  7]^  does  not 
scale  linearly  with  q.  The  Gaussian  approximation  of  chi-square  distribu¬ 
tion  indicates  that  for  large  q,  the  mean  of  L'^  approaches  9  +  7|t 
standard  deviation  approaches  \/2q.  Therefore,  the  scales  approximately 
as  —9  —  r^/9,  where  t  is  fixed  by  the  desired  p{L'i^  >  0  ]  Hq). 

One  can  see  that  >  0  |  Hq))^  the  average  number  of  times  that 

each  relative  likelihood  is  greater  thain  zero,  is  again  close  to  the  predicted 
value  1,000  for  all  k.  However,  (#(Choose^t  |  J^o))*  the  average  number  of 
false  alarms  on  processor  k,  increases  slightly  as  9  increases  for  all  k.  This 
dependency  on  9  was  not  predicted  by  our  likelihood  ratio  test.  Although 
we  do  not  have  an  exact  explanation  as  to  why  this  happens,  it  is  most  likely 
because  the  shape  of  the  probability  distribution  of  the  relative  likelihoods 
changes  with  increasing  9,  dropping  the  tail  of  the  distribution  curve  until 
it  looks  Gaussian.  The  false  alarm  is  caused  by  events  in  the  tail  of  the 
likelihood  distribution  curve.  However,  using  different  threshold  values  for 
difiefent  9  results  in  a  chi-square  distribution  whose  tail  shape  changes  with 
9.  The  integral  of  p(sl/fo)  over  decision  regions  Hk  will  be  a  complicated 
function  of  9.  The  increasing  false  alarm  rate  with  increasing  9  also  suggests 
that  the  likelihoods  get  more  uncorrelated  with  increasing  9. 

However,  this  dependency  on  9  is  weak  in  nature.  Since  the  probability 
of  processor  failure  is  estimated  at  best  in  orders  of  magnitude,  this  weak 
dependency  on  9  does  not  greatly  affect  the  application  of  our  projection 
method. 

Another  variable  dependent  on  9  is  the  noise  variance  of  the  cor¬ 
rected  output  of  the  processor  which  has  been  misdiagnosed  to  be  faulty.  In 
case  there  is  a  false  alarm  due  to  the  excessive  numerical  noise,  the  output 
of  the  processor  which  has  been  diagnosed  to  be  faulty  is  corrected.  The 
is  defined  as 

=  Var(y^  -  y  J  (5.13) 
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where 


ik^ik-k  (*•“) 

and  is  the  correct  value  of  without  any  numerical  noise.  In  our 
simulation,  since  the  correct  value  of  all  the  processors  is  set  to  zero,  the 
variance  of  the  corrected  output  which  has  been  misdiagnosed  to  be  faulty 
is  equal  to  Vcir(y^). 

When  there  is  a  false  alarm,  is  equal  to  the  numerical  noises  from  all 
the  processors  projected  onto  the  weight  vector  k. 


.  N+C 


Ikfc  II 


(5.15) 


The  output  of  the  processor  with  the  false  alarm  is  corrected  by  subtract¬ 
ing  Therefore,  the  corrected  output  has  its  own  processor  noise 
subtracted  out,  but  now  contains  the  projection  of  the  numerical  noises 
from  all  other  processors  with  non-orthogonal  weight  vectors. 


JV+C 


m  ^  k 


-  E 


(5.16) 


Therefore,  if  we  assume  that  S  is  the  white  noise  with  variance  <7?,  the 
^errl^v  equal  to 


.v+c 

E 

ni  =  1 
m  /  ^ 


myim 


3.33  for  A:  =  l,2,...,^r  (5.17) 


However,  in  case  of  a  false  alarm,  the  processor  noise  variance  would  most 
likely  be  higher  than  This  is  because  a  higher  than  normal  noise  level 
is  needed  to  cause  a  false  alarm.  Therefore,  the  numerical  noise  level  in  the 
corrected  output  would  also  be  higher  than  the  the  value  computed  above. 

Let  us  examine  what  happens  to  the  noise  level  in  the  corrected  output 
as  q  increases.  Using  the  Gaussian  approximation,  <ts  q  increases,  the  mean 
value  of  relative  leg  likelihood  increases  aa  q,  and  the  standard  deviation 
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9 

1 

3 

10 

30 

100 

(#(Li  >  0  1  <  6)> 

1,029 

1,010 

997 

1,005 

1,007 

{f{V,>0\Ho,6<k<N)) 

989 

1,020 

1007 

996 

1,025 

mi‘k>0\»o,k>  N)) 

999 

996 

1,003 

1,018 

999 

(# (Choose  \Ho,k<6)) 

312 

334 

337 

363 

372 

(#(Choose  1  ^0,6  <  it  <  N)) 

268 

300 

327 

324 

357 

(#(Choose^ffc  \Ho,k>  N)) 

365 

385 

403 

420 

419 

#  (Choose  Ho  \  Hq) 

5,960 

5,641 

5,465 

5,264 

5,088 

Olrlol 

11.5 

7.53 

5.22 

4.21 

3.57 

Table  5.3:  Simulation  with  Varying  q 

increases  as  y/^.  Therefore,  the  normalized  threshold  decreases  ap¬ 

proximately  as  0(1  +  y/^Tj y/q).  Such  a  decrease  in  normalized  threshold 
increases  the  detection  sensitivity  to  smaller  permanent  faults.  This  also 
causes  the  average  system  noise  level  needed  to  set  off  a  false  alarm  to 
decrease  with  increasing  q.  The  decreasing  syndrome  noise  level  with  in¬ 
creasing  q  in  the  false  alarm  case  is  reflected  on  as  shown  in  table  5.3. 
Note  that  this  is  for  false  alarm  cases  only.  In  the  event  of  a  real  fault, 
the  noise  level  in  the  system  will  most  likely  be  at  an  average  level.  There¬ 
fore,  the  numerical  noise  in  the  corrected  output  would  be  significantly  less, 
around  Ccrrl^r  ^ 

Also  note  that  the  probability  of  a  fault  occurring  increases  approxi¬ 
mately  linearly  with  the  processing  time  spent  on  the  batch.  Therefore,  for 
any  given  calculation,  ^  ®  I  ^o)  should  be  increased  approximately 
linearly  with  q  for  doing  the  same  task  (see  equation  4.11).  This  results  in 
a  more  rapidly  decreasing  o*,.,.  With  increasing  q  until  «  3.33.  The 

batch  size  q  used  in  the  fault  detection/correction  does  not  necessarily  have 
to  be  equal  to  the  computation  batch  size.  The  fault  detection/correction 
batch  size  q  would  most  likely  be  determined  by  the  tradeoff  between  the 
fault  detection  sensitivity  of  a  small  transient  fault  and  the  fault  detection 
sensitivity  of  a  small  permanent  fault. 
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p(L'*>0  Hq) 

0.25 

0.1 

0.05 

0.025 

0.01 

#(LJ^  >  0  1  Ho,  theory) 

2,500 

1,000 

500 

250 

100 

(#(i,l>o|/r„,t<6)) 

2,478 

997 

492 

249 

101 

(#(Li>0|Ho,6<  A;<iV)) 

2,480 

1,007 

512 

249 

102 

2,484 

1,003 

500 

248 

99 

(#(ChooseHifc|Ho,ifc  <6)) 

564 

337 

212 

128 

59 

(#(Choose  Hfc  1  Ho,6  <  A:  <  N)) 

540 

327 

202 

115 

55 

(#(ChooseHt  iHcA:  >  N)) 

680 

403 

250 

142 

65 

#( Choose  Hq  I  Hq) 

2,420 

5,465 

7,170 

8,350 

9,231 

olJol 

4.70 

5.22 

5.61 

6.06 

6.61 

Table  5.4:  Simulation  with  Varying  p{L'i^  >  0  |  Hq) 

5.2.4  Dependence  on  Changing  False  Alarm  Rate 

This  simulation  examines  the  system  behavior  vuth  changing  processor  re¬ 
liability.  Table  5.4  shows  the  set  of  simulations  with  five  different  values 
of  p{Ljt  >  0  I  Hq).  For  each  value,  the  simulation  histogram  was  collected 
from  10,000  simulation  trials.  The  processor  noise  standard  deviation  was 
again  set  to  op  =  10  Isb.  The  >  0  |  Hq)).,  the  average  number  of 

times  that  each  relative  likelihood  exceeds  zero,  is  again  close  to  the  the¬ 
oretically  predicted  value  1000  for  all  k.  However,  {#{Choose  Hk  \  Hq)), 
the  average  number  of  false  alarms  in  processor  k,  is  much  less  for  all  k, 
especially  for  high  values  of  p[V^  >  0 1  Hq).  This  is  because  the  high  values 
of  p(i/*  >  0 1  Hq)  are  unrealistically  high,  with  Ylk=\  P{^'k  >  0 1  Hq)  exceed¬ 
ing  1.  For  more  realistic  values  of  p(Z/*  >  0  |  Hq),  the  (#(Choose  Hk  |  Hq)) 
would  still  be  less  than  >  0 1  Hq)),  but  closer  to  it.  However,  it  is  dif¬ 

ficult  to  simulate  with  realistic  failure  rates  and  thresholds  since  it  requires 
too  many  trials.  (The  simulator  itself  will  probably  develop  a  fault  before 
the  result  is  complete.)  This  simulation  shows  that  for  realistic  values  of 
p(LJt  >  0|H^o)>  (# (Choose /fjk  I /^o))  would  be  within  the  order  of  magnitude 
of  (#(L]t  >  which  is  all  we  need  to  know,  since  the  processor  failure 

rate  itself  can  best  be  predicted  only  within  an  order  of  magnitude. 

Notice  that  al„,  the  noise  variance  of  the  corrected  output  of  the  pro¬ 
cessor  in  false  alarm  cases,  increases  with  decreasing  p[L\  >  0  |  Hq).  This 
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p. 


Figure  5.1:  Error  Distribution  for  One  Roundoff 

is  because  the  magnitude  of  '/j.  increases,  and  the  system  noise  level  needed 
to  set  off  a  false  alarm  also  increases. 


5.3  Probability  Distribution  of  the  Numeri¬ 
cal  Noise 

When  the  numerical  noises  are  white  Gaussian  random  variables,  we  can 
easily  set  the  thresholds  using  a  chi-square  function.  However,  the  numer¬ 
ical  noise  in  real  systems  is  not  exactly  white  Gaussian.  The  object  of  this 
section  is  to  model  the  fixed  point  roundoff  operations  as  independent  sta¬ 
tistical  processes  and  to  find  out  how  closely  the  probability  distribution 
of  the  fixed  point  roundoff  noise  based  on  this  model  matches  a  Gaussian 
distribution.  We  also  examine  how  one  can  compensate  for  the  difference 
by  adjusting  Tf'^.  Let  us  assume  that  each  roundoff  operation  generates  an 
independent  roundoff  error  which  has  a  uniform  probabilistic  distribution 
between  -1/2  Isb.  and  +1/2  Isb.,  where  Isb.  stands  for  the  least  significant 
bit.  This  distribution  is  shown  in  figure  5.1.  In  fixed  point  arithmetic,  addi¬ 
tion  and  subtraction  are  exact  and  only  multiplication  introduces  roundoff 
noise. 
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The  simple  example  chosen  in  this  section  deals  with  the  probabilistic 
distribution  of  the  numerical  noise  in  the  syndrome.  Suppose  that  each 
processor  uses  m  successive  roundings.  Suppose  further  that  we  use  a 
single  fault  detection  system  with  one  checksum  processor  whose  weights 
are  all  equal  to  1.  Then  the  total  number  of  roundings,  A/,  contributing 
quantization  noise  to  the  syndrome  is  equal  to 

M  =  {N  i  l)m  (5.18) 

Such  a  situation  can  be  imagined  when  a  fixed  point  FIR  filter  of  length 
m  is  implemented  in  such  a  way  that  m  input  points  are  first  multiplied 
with  m  coefficients,  rounded,  and  then  summed  to  produce  the  output.  In 
real  systems,  this  may  not  be  a  desirable  way  to  perform  FIR  filtering.  The 
preferred  method  would  be  not  to  perform  rounding  operation  after  each 
multiplication,  but  to  sum  the  non-rounded  products  and  to  perform  one 
rounding  operation  at  the  end.  One  would  need  an  accumulator  with  twice 
the  number  of  bits  to  implement  the  FIR  filter  this  way,  which  may  not 
always  be  possible.  In  this  case  m  would  be  equal  to  .1. 

Assuming  that  all  rounding  steps  are  independent,  the  probability  dis¬ 
tribution  of  the  syndrome  numerical  noise  is  equal  to  the  uniform  proba¬ 
bility  distribution  of  a  single  rounding  step  convolved  with  itself  M  times. 
As  M  approaches  infinity,  the  probability  distribution  of  the  numerical 
noise  approaches  a  Gaussian  distribution.  However,  when  M  is  finite,  the 
probability  distribution  of  the  numerical  noise  deviates  from  a  Gaussian 
distribution  especially  at  the  tail  ends  of  the  distribution.  For  example, 
when  M  =  2,  the  noise  distribution  is  triangular,  peaking  at  the  origin 
and  becoming  zero  at  -|-1  Isb.  and  at  -1  Isb.  The  variance  of  the  distri¬ 
bution  is  equal  to  l/V^  Isb.  Therefore,  the  tail  ends  of  the  distribution 
are  equal  to  zero  beyond  about  2.5  times  the  standard  deviation  from  the 
origin.  This  changes  our  fault  detection  strategy  greatly.  Since  false  alarm 
is  caused  by  events  in  the  tail,  such  a  drastic  change  in  the  shape  of  the  tail 
causes  the  threshold  to  be  modified  greatly.  For  example,  any  syndrome  be¬ 
yond  the  maximum  noise  range  should  be  now  considered  a  fault  detection 

=  -  1  Isb.).  As  M  gets  larger,  the  maximum  noise  range  increases  and 
the  distribution  curve  looks  more  like  Gaussian.  However,  the  tails  of  the 
curve  are  still  significantly  below  Gaussian  and  one  must  make  adjustments 
in  the  -y*. 
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Since  it  is  difficult  to  calculate  the  distribution  curve  for  M  larger  than 
2  in  a  closed  form,  we  have  used  a  numerical  simulation  in  order  to  find  out 
how  closely  the  distribution  curve  matches  the  Gaussian  curve  for  different 
values  of  M.  The  simulation  of  the  convolution  is  carried  out  for  values  of 
M  that  are  powers  of  2.  The  initial  uniform  distribution  {M  =  l)  between 
-1/2  Isb.  and  +1/2  Isb.  is  represented  by  2048  data  points.  The  simulation 
is  carried  out  by  first  convolving  the  uniform  distribution  curve  with  itself. 
This  gives  the  numerical  noise  probability  distribution  for  M  =  2.  The 
result  is  then  decimated  by  a  factor  of  2  in  order  to  keep  the  number  of 
data  points  from  growing  exponentially,  and  the  result  is  convolved  with 
itself  again  to  produce  the  noise  probability  distribution  for  M  —  A.  This 
decimation  and  convolution  process  is  repeated  for  the  desired  number  of 
repetitions. 

The  results  of  this  simulation  are  shown  in  figure  5.2  and  are  plotted 
against  a  Gaussian  distribution  curve.  The  abscissa  of  the  curve  is  normal¬ 
ized  by  the  standard  deviation  a  where  the  variance  a*  is  equal  to 

<T^  =  M/12  (5.19) 

When  M  is  low,  the  tail  er  is  of  the  noise  distribution  are  much  less 
than  Gaussian.  As  M  increases,  the  noise  distribution  gets  closer  to  the 
Gaussian  distribution  over  wider  and  wider  intervals  around  the  origin. 
Normally,  we  set  -7^  so  that  p(L^  >  0  |  Ho)  has  a  fixed  value.  If  the  tails 
of  the  syndrome  noise  probability  distribution  are  not  Gaussian  and  fall  off 
more  rapidly,  then  we  need  to  use  smaller  7]^  than  in  the  Gaussian  case. 
Since  the  tails  of  the  noise  distribution  are  closer  to  Gaussian  for  larger  M, 
the  adjustment  one  has  to  make  on  also  decreases  with  M. 

The  exact  value  of  depends  on  the  exact  shape  of  the  tail.  For  exam¬ 
ple,  when  q  =  I,  the  probability  that  L'j^  exceeds  zero  is  equal  to  the  integral 
of  the  syndrome  noise  distribution  curve  beyond  .±^J When  g  >  1,  the 
exact  value  of  7J.  can  be  calculated  from  the  probability  distribution  of  the 
relative  likelihoods. 


76 


p 


Noise  Distribution 


of  M  Roundoff  Operations 


77 


Figure  5.2; 


5.4  Numerical  Noise  Histograms  of  Real  Ap¬ 
plications 

The  previous  section  assumed  that  the  fixed  point  multiprocessor  sys¬ 
tem’s  syndrome  noise  is  the  result  of  independent  rounding  noises  that 
are  summed  together.  The  purpose  of  this  simulation  is  to  see  whether 
the  fixed  point  roundoff  operations  are  indeed  independent  statistical  pro¬ 
cesses.  We  are  interested  in  seeing  whether  the  syndrome  noise  probability 
distributions  of  real  systems  match  the  distribution  curve  of  the  previous 
simulation.  Specifically,  we  simulated  a  single  fault  detection  system  with 
ten  data  processors  and  one  checksum  processor  {N  =  10  and  C  =  1).  All 
the  data  processor  weights  are  equal  to  one  *  =  1)  and  the  checksum 

processor  weight  is  equal  to  minus  one  (tW7v+i,jv+i  =  —1).  The  simulated 
tasks  are  a  Finite  Impulse  Response  Filter  and  a  Fast  Fourier  Transform 
with  random  inputs  as  well  as  with  sinusoidal  inputs.  If  all  the  roundoff 
operations  were  random  and  non-correlated  in  nature,  the  numerical  noise 
in  the  syndrome  should  have  the  probability  distribution  discussed  in  the 
previous  section.  It  should  be  very  close  to  Gaussian  around  the  origin, 
and  drop  below  Gaussian  in  the  tail  ends. 

Figure  5.3  shows  the  syndrome  histogram  when  all  the  processors  are 
working  correctly  and  when  the  processing  task  is  a  10  point  FIR  filter.  In 
each  processor,  the  input  is  first  multiplied  by  the  coefficients,  the  results 
are  rounded,  and  then  summed  to  produce  the  output.  There  is  one  round¬ 
off  operation  associated  with  each  multiplication.  That  means  the  syn¬ 
drome  contains  the  numerical  noise  which  is  the  sum  of  M  =  [N  +  l)  x  g  = 
110  roundoff  operations.  Filter  coefficients  are  chosen  randomly,  and  the 
results  are  calculated  for  both  random  input  and  sinusoidal  input.  The 
results  match  the  Gaussian  curve  very  closely.  Although  the  previous  sim¬ 
ulation  indicates  that  the  histogram  should  be  much  less  than  Gaussian 
near  the  tail,  it  would  take  an  impractically  large  number  of  repetitions  of 
the  simulation  in  order  to  be  able  to  see  the  tapering  off  the  tail  ends  from 
Gaussian  distribution. 

Figure  5.4  and  figure  5.5  show  the  syndrome  histograms  for  fixed  point 
FFT  operations  in  a  single  fault  detection  system  with  all  processors  work¬ 
ing  correctly.  The  algorithm  chosen  for  the  FFT  is  the  straightforward 
radix  two  butterfly  algorithm  [Oppenheim  75].  There  is  no  scaling  by  a 
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factor  of  two  between  the  stages  of  the  butterfly.  (If  there  were  scaling  by 
two  in  each  stage,  the  theoretical  signal  to  noise  ratio  based  on  indepen¬ 
dent  roundoff  assumption  would  have  been  approximately  g/4  times  higher 
(Oppenheim  75].)  There  is  a  roundoff  operation  associated  with  each  multi¬ 
plication.  Therefore,  each  complex  multiplication  introduces  two  roundoff 
operations  into  each  of  the  real  and  the  imaginary  parts  of  the  output.  The 
real  and  imaginary  parts  of  the  syndrome  noise  are  treated  as  separate  nu¬ 
merical  noises  in  the  histogram.  Both  64  point  and  1024  point  FFT’s  are 
simulated  for  random  input  and  sinusoidal  input. 

These  histograms  differ  from  Gaussian  distribution  significantly,  peak¬ 
ing  around  the  origin  and  much  below  Gaussian  near  the  tails.  This  result 
is  consistent  with  Welch’s  work  [Welch  69],  which  found  that  the  numerical 
noise  variance  is  much  less  than  predicted  by  theory  based  on  randonmess 
of  the  roundoff  operations.  Therefore,  the  likelihood  ratio  constant  'yj.  as¬ 
sociated  with  the  FFT  processing  systems  would  be  significantly  less  than 
the  associated  with  the  Gaussian  noise  case. 

These  simulations  indicate  that  the  numerical  noise  distribution  pro¬ 
file  depends  heavily  on  the  application  and  is  not  necessarily  close  to  the 
Gaussian  profile.  Ther^  ..le,  the  -7^  in  the  likelihood  ratio  test  have  to  be 
adjusted  for  each  application  appropriately.  The  exact  value  of  the  7]^  can 
be  determined  from  the  relative  log  likelihood  probability  distribution  un¬ 
der  the  no-fault  condition  Hq.  The  integral  of  the  probability  distribution 
over  the  range  >  0  should  be  set  to  the  desired  false  alarm  rate.  In 
a  case  such  as  the  FFT  case,  the  7]^  would  be  significantly  less  than  the 
Gaussian  case,  due  to  the  much  faster  falling  tails. 
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F  lire  5.3:  Noise  Histogram  of  FIR  Single  Fault  Detection  System 
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Figure  5.4:  Noise  Histogram  of  64  Point  FFT  Single  Fault  Detection  System 
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Figure  5.5:  Noise  Histogram  of  1000  Point  FFT  Single  Fault  Detection 
System 
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Chapter  6 

Multiple  Fault  Correction 
^systems  with  Numerical  Noise 


6.1  Projection  Method 

6.1.1  Km  Fault  Correction 
The  projection  method  for  if 

syndrome  s  on  to  hyperplanes  belon  ■  projection  of  the 

Hk,...kK  represent  the  hypothesis th^t \h 
failed,  where  Q<  K  <  K  processors  have 

different  failur.  hypotW  '  ^O-  k)tk<) 

-  «„p..e  or  .0  p„cZ  ::::t;::CaXr  - 


--mis  k  +  EW»l||^„ 

*1  -*K  1=1 


(6.1) 


mllX  . ™lues.  Solving  for  the  mini- 
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L  ^fcK  J 
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.  w'L  . 

y-l. 


(6.2) 


We  can  carry  out  the  threshold  test  for  the  normalized  projection  energy 
in  each  failure  hypothesis  with  0  <  K  <  Km  by  calculating 

for  each  projection. 


K 


t-1 


*k] 


+  '1ki...kK 


(6.3) 


where  is  a  constant.  If  one  of  the  exceeds  zero,  then  the 

processors  fcj, ...,  k/c  are  chosen  to  be  faulty.  If  more  than  one  exceed 

zero,  then  the  largest  is  chosen  to  be  the  failure  hypothesis. 

Once  the  processors  fci, ...^kx  are  estimated  as  being  faulty,  the  correct 
values  of  the  processor  outputs  are  calculated  by 


^ki  ~  y.ki  ~iki  1  =  1,...,-^  (6.4) 

for  the  data  processors. 


6.1.2  White  Noise  Case 

When  the  numerical  noises  are  white  (V  =  Oyl),  the  computational  effort 
can  be  reduced  by  using  the  following  calculation  method. 


Pk,l  = 


Ov 


=  for 


k'  —  N  +  l,...,iV  +  C 
I  =  N  +  1,...,N  +  C 


(6.5) 
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=  tr  < 


Gkt-kK 


Pn+i,n+i  •••  Pn+i,n+c 


+  (6-6) 


PN+C.N+l  •••  PN+C.N+C 
where  Gki...kK  's  a  precomputed  C  x  C  matrix  of  constants  defined  «ls 


^N  +  l,ki  • 

•  •  “'JV+l.fcK 

-1  r 

_  ^N+C,ki  : 

^N+C.kK  . 

f^Pf+l.ki 
^^N  +  l,kK 


(6.7) 

where  r*/  is  the  cross-correlation  between  the  and  weight  vectors. 

The  estimation  of  the  faulty  processors’  correct  output  also  becomes 
simpler. 


V}N+C,k, 

WN  +  C.k,( 


it,  =  y.ki  ~  ±ic.  ^  ^ 

where  <f>.  can  be  efficiently  computed  as 

— Ac,- 

^  AT+C 

tki  ^  ~  Y.  '^m.k.Sjn  (6.9) 

m=N  +  l 

where  the  normalized  weights  u>m,ki  can  be  precomputed  as 


•  •  •  ^N+C.ki 

rfc.t, 

•••  rktkK 

-1 

•  lA’N+C.fc, 

.  “'jV  +  l.fcK 

•  •  •  U?,v+C,t,,- 

.  ^kKki 

^kfckt;  _ 

.  ^N+\.kK  • 

^N+C,kK 

(6.10) 

If  we  assume  that  the  2Km  +  1  modular  redundancy  technique  is  used 
for  doing  the  fault  detection/correction  from  the  syndromes,  then  the  com¬ 
putational  overhead  ratio  Rc  for  the  Km  correction  is  equal  to 
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Rc 


c  g(p  +  g) 

TV  T 


{2Km  +  1) 


C(C  +  l)<j  ^(^+1)  (N+C)\ 

2  ,  2  [N+C-k)\k\ 

NT  NT 


K,nCq 

NT 


(6.11) 


The  first  term  represents  the  computation  in  the  C  checksum  processors. 
The  second  term  represents  the  input  checksum  and  syndrome  calculations. 
The  third  term  is  computation  involved  in  the  fault  detection/correction 
algorithm  replicated  2Km  +  1  times.  The  first  term  inside  the  big  bracket 
in  the  third  term  represents  the  computation  of  pki.  The  second  term  in 
the  bracket  represents  the  computation  of  the  relative  log  likelihoods,  and 
the  third  term  represents  the  fault  correction.  As  Km  increases,  the  num¬ 
ber  of  log  likelihoods  increases  as  +  C  —  k)lk\)  and 

the  term  associated  with  the  computation  of  the  log  likelihoods  grows  as 
0{{2Km  +  1)(TV  -h  Therefore,  this  system  may  require  more  over¬ 

head  computation  than  a  2Km  +  1  modular  redundancy  system,  even  for  a 
moderate  Km- 


8.1.3  Reliability 

For  the  system  that  can  detect  or  correct  up  to  Km  faults,  a  system  failure 
occurs  if  more  than  Km  processors  fail.  If  we  assume  that  all  the  processors 
have  the  same  failure  rate  Pj  and  the  processor  failures  are  independent  of 
each  other,  then  the  system  failure  rate  is  equal  to 


Prob(>  failures)  w' 


(JV  +  C)! _ pH. 

(iv  +  c  -  + 1)!  > 


■M 


(6.]  2) 


In  comparison,  the  system  failui  j  ate  of  a  system  in  which  there  are  2Km+l 
copies  of  each  data  processor  used  in  modular  redundancy  form  is  equal  to 


Prob(>  Km  failures  in  any  redundant  group) 


w  {2Km  +  1)1  pKc  +  l 

+  1)!  ‘ 

(6.13) 
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The  failure  rate  of  our  system  is  approximately  -iyi 

times  higher  than  the  2if^  + 1  modular  redundancy  system,  but  uses  A^  +  C 
processors  rather  than  N[2Km  +  1)  processors,  where  the  minimum  C  is 
equal  to  2Km.- 

The  major  difficulty  in  implementing  a  Km  fault  detection  or  correction 
system  is  that  other  parts  of  the  system,  such  as  the  data  distribution 
network,  fault  detection/correction  hardware,  clocks,  controls,  and  power 
sources,  have  to  be  as  reliable  as  the  protected  processor  group.  If  the 
failure  rates  of  other  parts  are  comparable  to  the  single  processor  failure 
rate,  they  will  have  to  be  protected  by  a  2Km  +  1  modular  redundancy 
technique.  Therefore,  even  though  the  number  of  the  processors  does  not 
increase  dramatically  with  Kmt  the  size  of  other  hardware  parts  grow  as 
0{2Km  +  l). 

The  misdiagnosis  probability  in  multiple  fault  correction  is  also  much 
higher  than  in  single  fault  correction  because  there  are  many  more  hy¬ 
potheses  compared  to  the  number  of  the  checksum  processors.  Therefore, 
the  angles  between  the  projections  would  be  smaller  than  in  the  single  fault 
correction  case,  and  it  is  more  likely  that  the  numerical  noise  may  push  the 
syndrome  closer  to  a  neighboring  hyperplane  of  another  failure  hypothesis. 

6.1.4  Weight  Vectors 

It  is  difficult  to  find  suitable  weights  for  the  multiple  fault  detection/correction 
systems.  Jou  and  Abraham  [Jou  86]  used  Wk,m  =  as  weights 

for  the  data  processors.  They  have  also  proven  that  with  these  weights,  any 
combination  of  C  weight  vectors  are  linearly  independent  from  each  other. 
These  weights  also  have  the  advantage  that  multiplication  by  a  power  of  2 
can  be  done  in  simple’ bit-shifts.  However,  the  dynamic  ranges  of  the  check¬ 
sum  processor  registers  have  to  be  much  greater  than  those  of  the  the  data 
processors  in  order  to  be  able  to  accommodate  for  Wf^+c,N  = 

The  weights  vary  greatly  in  magnitude  and  the  numerical  noises  from  the 
processors  with  large  weights  will  heavily  mask  the  numerical  noises  from 
the  processors  with  small  weights  in  the  syndromes.  Therefore,  the  sy  .!,em 
would  be  less  sensitive  to  detecting  failures  in  the  processors  with  small 
weights. 

Using  low  integers  as  weights  is  one  possible  way  to  get  around  the  dy¬ 
namic  range  and  the  noise  masking  problems.  We  have  found  the  following 
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example  set  of  weight  vectors  for  double  fault  correction  with  C  =  A.  These 
vectors  are  found  by  searching  through  all  possible  weight  vectors  with  a 
given  range  of  weights  and  picking  the  weight  vectors  in  such  a  way  that 
any  set  of  2Km  vectors  are  linearly  independent.  Using  only  0  and  ±1  as 
weights,  there  can  be  only  one  data  processor  [N  =  1)  with  the  weight 
matrix 


W 


-1  0  0  O' 

0-100 
0  0-10 
0  0  0  -1 


(6.14) 


This  is  equivalent  to  the  5-way  modular  redundancy  technique.  Using  0, 
±1,  and  ±2  as  weights,  we  found  four  data  processor  weight  vectors  [N  =  4) 
with  the  weight  matrix 


W  = 


1 

-1 

2 

2 


1 

2 

■2 

1 


1  -1 
-2  0 

-1  0 

2  0 


0 

1 

0 

0 


0  0 

0  0 

-1  0 

0  -1 


(6.15) 


Using  0,  ±1,  ±2,  and  ±3  as  weights,  we  found  six  data  processor  weight 
vectors  (A'"  =  6)  with  the  weight  matrix 


1112-100 
2  -2  -3  3  0  -1  0 

-1  -3  -2  -3  0  0  -1 

-3-1  3  1  000 


0  ■ 
0 
0 

-1 


(6.16) 


Using  0,  ±1,  ±2,  ±3  and  ±4  as  weights,  we  found  six  data  processor  weight 
vectors  {N  =  9)  with  the  weight  matrix 

■1  1  1  1  1  1344  -1  000' 

_  1  -1  2  -2  3  -3  -2  3  -3  0  -1  0  0 

1  2  -1  -3  -4  -2-4-3  1  0  0  -1  0 

1  -2  -3  -1  2  3  1  1  2  0  0  0  -1 

(6.17) 
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The  number  of  data  processor  weight  vectors  N  was  found  to  be  de¬ 
pendent  on  the  order  the  vectors  were  picked.  This  indicates  the  existence 
of  many  different  sets  of  weight  vectors  with  weights  0,  ±1,  ±L,  with 

different  numbers  of  weight  vectors  in  different  sets.  Although  using  low 
integers  as  weights  spreads  the  weight  vectors  relatively  far  apart  from  each 
other,  the  angles  between  the  hyperplanes  of  different  failure  hypothesis  Hk 
are  not  necessarily  large.  If  the  angles  between  the  hyperplanes  are  small, 
the  probability  of  misdiagnosis  increases  because  the  decision  regions  are 
narrow. 

Another  possible  set  of  weights  for  fault  correction  is  to  use  the 
weights  in  the  following  form. 


Wk.m  =  a'*  ^  I)!*"-!)  where  A  ^  1  (6.18) 

Jou  and  Abraham  have  proven  that  any  set  of  C  weight  vectors  are  linearly 
independent  when  A  =  2.  However,  their  proof  is  valid  for  some  other 
values  of  A  as  well.  When  the  computation  is  real,  any  value  other  than  1 
can  be  used  for  A.  W'hen  the  computation  is  complex,  A  can  be  of  form 

A  =  (6.19) 

For  example,  using  A  =  is  a  possibility. 


6.1.5  Practicality 

The  multiple  fault  correction  system  may  be  impractical  to  implement  be¬ 
cause  of  the  reasons  we  have  discussed  so  far,  and  may  not  have  much 
advantage  over  the  2Km  +  1  modular  redundancy  system.  It  is  difficult 
to  find  suitable  weight  vectors.  The  amount  of  computation  needed  grows 
as  0[{2Km  +  l)(-/V^  +  C')^’"”*),  and  the  overhead  computation  can  easily 
become  more  than  in  the  2/^^  +  1  modular  redundancy  method.  Further¬ 
more,  it  is  difficult  to  make  rest  of  the  system  hardware  as  reliable  without 
using  the  2Km  -I- 1  modular  redundancy  method. 

It  is  also  difficult  to  develop  good  ad  hoc  methods  to  reduce  the  fault 
detection  computation  as  we  did  in  the  single  fault  correction  case.  How¬ 
ever,  when  the  computation  is  exact,  there  are  some  good  ad  hoc  methods, 
as  we  shall  discuss  later. 
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6.2  Generalized  Likelihood  Ratio  Test 


If  we  assume  that  the  numerical  noise  is  Gaussian,  we  can  used  the  gener¬ 
alized  likelihood  ratio  test  that  we  have  used  for  the  single  fault  correction 
case  for  multiple  fault  correction  [Musicus  88 j.  For  the  fault  correction 
caise,  the  processor  faults  are  again  modeled  as  non-zero  mean 

Gaussian  random  variables  with  unknown  means  4>  and  known 

covariances  The  numerical  noise  from  the  fault-free  pro¬ 

cessors  are  modeled  cis  zero  mean  Gaussian  random  variables  with  known 
variance  The  log  likelihood  for  the  failure  hypothesis  Hk^...kK  is  defined 
as 


Lk,...k^  =  _  max  logp(s,//fc,..jt^.|<^ 

-A:  I 

Let  be  the  values  of  maximize  the  log  likelihood. 

A  A 

The  ‘’’j^kK  be  thought  of  as  the  most  likely  failure  sizes  that  would 
have  caused  the  syndrome  s.  Using  Bayes’s  Rule  and  substituting  Gaussian 
densities  into  the  log  likelihoods,  and  then  solving  for  the  maximum 
respect  to  ,  we  can  show  that 


Iv-i  (6-21) 


. 


min 

<l>.  . 4>. 

-t,  -k^ 


■  wl,  ■ 

V-‘|W,,. 

-1 

■  wr,  ■ 

1 - 

1 _ 

.  wt , 

(6.22) 


which  is  the  same  as  the  projection  case.  Substituting  this  into 


Lo 

Lki...kK 


“oil-'*' +lki...kK 

1  =  1 


(6.23) 
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If  we  define  the  relative  log  likelihood  I/'*,  as 

^'ki  .  kK  ~  ‘^i^k,...kK  ~  ^o)  (6.24) 

then 


K 


1  =  1 


L  J 


+  ik.k^  (6-25) 


where  is  a  constant.  This  is  exactly  the  same  result  as  the  projection 

method. 

If  the  syndromes  are  also  used  for  reducing  the  numerical  noise  as  well 
as  for  fault  detection/correction,  then  it  can  be  shown  that  the  estimates 
of  the  correct  processor  outputs  are 


+  EHi 


for  m  7^  ki, ...,  fc/f 


for  m  =  ki,  ...,kfc 

(6.26) 
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Chapter  7 

The  Exact  Arithmetic  Systems 


7.1  The  Integer  Arithmetic  Systems 

7.1.1  Single  Fault  Correction 

When  integer  arithmetic  with  no  rounding  is  used  in  fault  tolerant  multi¬ 
processor  architecture,  no  numerical  noise  is  introduced  and  all  the  com¬ 
putations  are  exact.  This  makes  single  fault  detection/correction  proce¬ 
dure  simple.  If  there  is  no  fault,  all  the  syndromes  will  be  equal  to  zero 
(5  =  0).  If  there  is  a  fault  in  processor  K,  the  syndromes  will  be  equal  to 
Sfn  =  A  good  fault  detection/correction  method  is  to  do  the  fault 

location  on  a  point  by  point  basis  as  if  g  =  1  using  the  slope  si/sm  between 
the  syndromes.  For  example,  in  a  single  fault  correction  system  with  two 
checksum  processors  {C  =  2),  the  syndrome  slope  s/v+2/sn+i  is  equal  to 

SN+2  _  ^N+2,k  .  . 

SyV+l  U^AT  +  i,* 

If  there  are  more  than  two  syndromes,  one  has  to  compute  the  slopes  be¬ 
tween  (7  —  1  pairs  of  the  syndromes. 

7.1.2  Multiple  Fault  Correction 

Previously  in  section  2.1.2,  we  mentioned  that  a  single  fault  detection/coirection 
system  using  only  0  and  1  as  weights  is  equivalent  to  using  error  coding 
techniques  in  the  multiprocessor  environment.  The  checksum  processors  in 
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this  case  look  very  much  like  “parity”  processors.  Using  this  parity  proces¬ 
sor  technique,  many  existing  single  error  correcting  code  techniques  based 
on  parity  checks,  such  as  Hamming  Code,  can  be  directly  applied  to  the 
multiprocessor  systems. 

Multiple  error  correction  codes,  however,  cannot  be  as  readily  applied 
to  the  multiprocessor  system  as  can  the  single  error  correction  codes.  The 
reason  is  that  the  parity  bit  techniques  rely  on  the  fact  that  the  parity 
bit  is  formed  by  the  modulo  2  addition  of  the  protected  bits.  That  means 
that  an  odd  number  of  faulty  bits  produces  a  parity  mismatch  and  an  even 
number  of  faulty  bits  produces  a  the  parity  match,  which  is  not  applicable 
if  the  addition  is  not  of  modulo  2. 

The  reason  why  it  would  be  desirable  to  be  able  to  apply  the  mul¬ 
tiple  error  correcting  code  to  the  multiprocessor  system  is  that  in  or¬ 
der  to  figure  out  which  processors  are  faulty,  the  system  has  to  figure 
out  which  multidimensional  hyperplane  of  the  different  failure  hypothe¬ 
sis  the  syndrome  belongs.  For  the  Km  fault  correction  system,  there  are 
12k=\i^  +  C)\/{{N  +  C  —  k)\k\)  hyperplanes,  each  corresponding  to  one 
possible  failure  hj'pothesis.  Therefore,  figuring  out  which  hyperplane  the 
syndrome  belongs  to  can  take  a  lot  of  computation  even  for  a  moderate 
Km^ 

One  possible  way  to  apply  a  multiple  error  correction  code  to  multipro¬ 
cessor  architecture  is  to  apply  it  at  the  bit  levels.  Consider  the  follov/ing 
multiple  error  correction  code  used  by  a  data  transmission  system  in  which 
the  bits  6i,  are  protected  by  the  “parity”  bitsft/^f+i,  bN+2i";bN+c- 

The  parity  bits  b^+i,  6;v+25  are  calculated  at  the  transmission  end 

to  be 


X)  j  mod  2  for  k  =  N  +  +  C  (7.2) 

where  the  weights  Wk,m  are  I’s  and  O’s.  At  the  receiving  end,  the  fault 
location  is  done  with  the  parity  che^ ..  syndrome  s*. 


Sk 


mod  2 


for 


^  1, 


N  +  C  (7.3) 
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wftt'rX  tabt”  . ^''-) 

syst^  arth?h!n‘''‘"l  “tJ''’!'  “greeting  code  to  the  multiprocessor 
system  at  the  bit  levels.  The  mpot  checksums  in  this  case  are  fornred  usina 
the  same  set  of  weights  'ormea  using 


JV 

for  k  =  fV  +  l,,..,AI+C  (7.4) 

at'the'hiri'’“‘!  ‘’'“‘”"8  “  o"  »  point  by  point  basis  (as  if  ,  =  i) 

corrected  iTt  6  “u  “f  ‘ho  dnta  processors  are 

output  p*.  The  least  significant  bits  h„ .  . '>’*  ”  Prwessor 
the  bits  h  f,  ^1,1  S'fe  corrected  using 

cLe  n  ord^rl  error  correcting  code 

formed.  °  °  correction,  C  least  significant  bit  syndromes  are 


JV 

^k  =  h^x  -  Wk, 


771=1 

^th 


TTiim,!  for  kr=  N  +  I, ...,  TV  -f  c 


(7.5) 


Let  s*  „  represent  the  n''*  least  significant  bit  of  Sk.  The  faulty  bits  are 

then  calculatel  ™'““  k,  are 

hits'l”"  «e  corrected,  the  second  least  significant 

sero  d"?  *. ‘™a  *•'0  syndromes  are  calculated  from  the 

second  least  significant  bits  and  the  correct  least  significant  bits  6*,.. 


N 


bk,2  ^^^k,Tni2bm,2  +  bn,i)  foT  k  =  N  +  1,  ...^  JSf  +  C  (7.6) 

Then  the  fault  corrections  for  are  done  with  the  code  word  (,S(v+c;2 
«/v+c-i.2,--,s/v+i,2).  Since  the  least  significant  bits  b„  i  are  correct  thp  oHyl 

r.  il  ^  an^r"^ r 

thr’Lrit  ’  number  of  faulty  least  significant  bits  produces 

snatch  (5,„,2  =  0),  just  like  the  error  correcting  code  case. 
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Then  the  third  least  significant  bits  are  corrected  and  so  on.  For 
leeist  significant  bit  correction  Sk  is  equal  to 

N  /  n  \ 

5^  =  6^-  E  for  k  =  N  +  l,...,N  +  C  (7.7) 

m-l  \  l=l  ) 

and  the  fault  correction  is  done  with  the  code  word  (5^+c,n1  5^+c-i,nv5 

Although  using  only  I’s  and  O’s  as  weights  requires  more  than  the  min¬ 
imum  number  of  checksum  processors,  using  the  multiple  error  correcting 
codes  in  bit  by  bit  fashion  for  integer  processors  may  be  much  simpler 
than  using  weighted  checksum,  since  figuring  out  which  decision  region  the 
syndrome  belongs  to  is  most  likely  a  computationally  intensive  task.  This 
technique  may  be  especially  useful  for  bit-serial  type  machines. 

7.1.3  Modulo  Arithmetic  in  Checksum  Processors 

In  integer  arithmetic  systems,  the  weights  are  integers  as  well.  Assuming 
that  there  is  no  overflow  or  rounding  in  the  processors,  we  have  to  use  more 
bits  in  the  checksum  processors  than  in  the  data  processors  to  prevent  the 
overflow  in  the  checksum  processors.  In  a  single  fault  detection/correction 
system,  the  checksum  processor  k  should  have  logj  uik  more  bits  than  the 
data  processors  in  order  to  prevent  the  overflow,  where  the  Uk  is  defined  as 

N 

liWm.itl  for  m  ^  N  +  +  C  (7.8) 

m=l 

We  can  reduce  the  number  of  extra  bits  needed  in  the  checksum  processors 
by  using  modulo  M  arithmetic  in  the  checksum  processors.  The  modulo 
M  should  be  carefully  chosen  so  that  using  the  modulo  arithmetic  in  place 
of  the  integer  arithmetic  does  not  change  the  value  of  the  syndrome  s„j. 
Suppose  that  the  data  processor  outputs  y^'s  are  always  in  the  range  |y^|  < 
R.  Then  the  syndrome  of  the  single  fault  detection/correction  system 
would  always  be 


5,„1  <  max(|iym,jt|)i2 


(7.9) 
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Therefore,  if  the  modulo  M  arithmetic  is  used  in  the  checksum  processor 
m  where 


M  >  m^{\w,n,k\)R  (7-10) 

then  the  syndrome  would  be  identical  to  the  syndrome  of  an  ordinary 
integer  arithmetic  system.  Note  that  modulo  M  arithmetic  can  be  used  in 
the  input  checksum  calculation  and  the  syndrome  calculation  as 
well  without  changing  the  value  of  the  syndrome 

An  easy  form  of  modulo  M  arithmetic  to  implement  in  hardware  is  when 
M  is  a  power  of  2,  Therefore,  a  good  choice  of  M  would  be  a  pov/er  of  two 
that  satisfies  M  >  max{|u»m,fc|)i2-  This  modulo  arithmetic  is  especially  easy 
to  implement  if  only  the  weights  -1,  0,  and  +1  are  used  in  a  single  fault 
detection/correction  system.  Suppose  that  R  =  2^  where  B  is  the  number 
of  bits  used  in  the  data  processor  registers.  In  this  case,  one  can  use  modulo 
2®  arithmetic  in  the  checksum  processors,  and  all  the  processors  would  have 
ihe  same  number  of  bits  in  their  registers,  simplifying  the  hardware  design. 

For  the  Km  fault  detection  or  correction  systems,  one  has  to  use  modulo 
M  arithmetic  in  the  checksum  processor  and  the  corresponding  input 
checksum  and  syndrome  calculations,  where  M/R  is  equal  to  or  greater 
than  the  sum  of  the  Km  largest  |in„i^jt|’s.  This  choice  of  M  again  insures 
that  the  syndrome  value  is  not  changed.  Using  a  power  of  2  for  M  is 
convenient  to  implement  in  hardware. 


7.2  Residue  Arithmetic  Systems 

7.2.1  Residue  Number  System  Multiprocessors 

Our  fault  tolerance  technique  using  weighted  checksums  can  also  be  applied 
to  multiprocessor  systems  using  residue  number  system  (RNS)  processors. 
A  RNS  processor  converts  the  input  to  L  residues  of  different  modulos 
Ml, M2,..., Mi  and  processes  each  residue  independent  of  each  other.  All 
the  residues  are  combined  and  converted  back  to  integers  at  the  output.  The 
modulos  Ml, M2,..., Mi  have  to  be  mutually  prime.  The  RNS  processor 
potentially  can  run  much  faster  than  the  conventional  integer  processors 
because  there  is  no  carry  chain  delay  involved  between  residues  [Taylor  84] . 
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A  RNS  processor  can  be  viewed  as  an  integer  processor  which  performs 
modulo  M  arithmetic,  where  M  = 

In  a  conventional  RNS  processor,  fault  tolerance  is  achieved  by  using 
extra  residues  [Etzel  80].  For  example,  single  error  detection  is  achieved 
by  using  one  extra  residue,  which  is  of  a  modulo  that  is  greater  than  the 
other  modulos  and  is  mutually  prime.  The  fault  detection  is  done  by  range 
detection.  If  there  is  a  fault  in  one  of  the  residues,  the  output  y,  which  is 
converted  from  the  L  +  1  residues,  is  not  within  the  acceptable  range  of 
0  <  y  <  M,  where  M  =  fl/^i  A//. 

Single  fault  correction  in  a  conventional  RNS  processor  is  done  with  two 
extra  residues,  which  are  of  modulos  that  are  greater  than  the  original  L 
modulos.  All  {L  +  2)  modulos  have  to  be  mutually  prime.  At  the  output, 
L+2  outputs  are  formed,  each  from  a  different  set  of  (I'+l)  modulos.  When 
there  is  no  fault,  all  these  outputs  are  identical.  When  there  is  a  fault,  only 
one  output  would  be  within  the  acceptable  range  0  <  y  <  M,  and  that 
is  the  correct  output.  It  takes  Lm  extra  residues  for  Lm  fault  detection, 
and  2Km  extra  residues  for  Km  fault  correction.  Converting  L  +  2  sets  of 
L  +  I  modulos  to  integers  and  doing  the  range  test  is  a  computationally 
intensive  task.  Also,  the  (L  +  and  {L  -I-  2)'^  modulos  that  are  used  for 
fault  tolerance  have  to  be  bigger  than  the  original  L  modulos,  increasing 
the  size  of  the  hardware  involved  with  those  modulos. 

For  a  high  throughput  system  which  uses  multiple  RNS  processors  par¬ 
allel,  the  weighted  checksum  architecture  can  be  used  for  fault  tolerance 
instead  of  using  extra  residues  in  each  processor.  Assuming  that  there  is 
no  overflow  (overflow  over  M  where  M  =  0/^=1  ^i)  in  the  data  proces¬ 
sors,  the  weighted  checksum  technique  can  be  used  in  the  same  way  as  in 
the  ordinary  integer  processor  systems.  The  only  difference  is  that  if  the 
checksum:  processor  dynamic  range  needs  to  be  larger  than  the  data  pro¬ 
cessor  dynamic  range,  then  one  has  to  use  extra  modulos  in  the  checksum 
processor  in  order  to  increase  its  dynamic  range.  If  only  -1,  0,  and  -|-1 
are  used  for  single  fault  detection/correction,  the  checksum  processors  can 
be  modulo  M  arithmetic  processors  (i.e.  checksum  processors  are  identi¬ 
cal  to  the  data  processors)  6is  discussed  previously  in  the  integer  processor 
systems.  If  weights  other  than  -1,  0,  and  -j-l  are  used  for  single  fault  detec¬ 
tion/correction,  the  checksum  processor  m  should  use  extra  residues  (use  L' 
residues  where  V  >  L)  so  that  M'  >  max(|u;m.*|)M  where  M'  =  -Hi- 
If  Km  fault  detection/correction  is  desired,  the  M' /M  should  be  equal  to 
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or  greater  than  the  sum  of  the  largest 

e  major  advantage  of  our  fault  tolerant  architecture  is  that  f  u 

versions  which  are  compu.atio„:?ry  coT  con- 

7.2.2  Modulo  Arithmetic  Processor  Systems 

In  the  previous  section,  we  have  uspH  V  pmc 

~zrsx““rj;;xtr -T^ 

?hirreq'  T  f  inThedTta  proceTsorT 

rS~=“-““r~ 


(01  tt’,  +  ^jWj)  mod  Ml  ^  0  for  | 


I  4>i  =  l,2,...,Mt~l 
4>j  =  1,2,...,  Ml  -  1 

» /y 


(7.11) 


itelrly' indeSn^’m^dd  be  pairwise 

""  for?he  doublefeuh 
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Since  Mi  is  usually  fairly  small  in  RNS  systems  in  order  to  speed  up  the 
computations,  there  is  a  limited  number  of  points  in  the  syndrome  space 
{sn+i,  spf+2,-.-,  sn+c)-  The  limited  syndrome  space  limits  the  number  of 
data  processors  that  can  be  protected  using  C  checksum  processors.  Since 
each  syndrome  Sk  can  assume  Mi  different  values  (0,  1, ...  ,  M,  — 1),  there  are 
only  Mf  points  in  the  syndrome  space.  Out  of  Mf^  points  in  the  syndrome 
space,  the  no-fault  Hq  hypothesis  takes  up  one  point  (s  =  0).  Each  failure 
hypothesis  takes  up  M/  —  1  points. 


s  =  for  4>k  ^  1,2,..., Ml  -  1  (7.12) 

Therefore,  the  maximum  number  of  processors  is  equal  to 

max(Ar  +  C)  =  — (7.13) 

However,  this  is  achievable  only  if  Mi  is  prime.  If  Mi  is  not  prime  but  is 
equal  to 


Mi=]lPn  (7.14) 

n=  1 

where  F„’s  are  primes,  the  weight  vectors  have  to  be  linearly  independent  in 
all  the  modulo  Fi’s  in  order  to  satisfy  the  unique  decision  region  condition 
in  equation  7.11.  Suppose  that  two  weight  vectors  w,  and  Wj  are  linearly 
dependent  in  modulo  Pn- 


(til,  +  atWj )  mod  Fn  =  0  for  (7-15) 

Then  for  the  processor  error  values  of  <f>i  =  IlmTSn  and  <f)j  —  Pm, 

we  have 

{(piUli  +  4>jUij)  mod  Ml  =  0  (7.16) 

Therefore,  processor  error  of  size  (pi  on  processor  i  yields  exactly  the  same 
syndromes  as  a  failure  of  size  a(pi  on  processor  j.  One  cannot  distinguish 
between  these  failures  and  cannot  reliably  correct  the  fault.  Therefore,  in 
order  to  satisfy  the  unique  decision  region  condition  in  equation  7.11,  the 
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weight  vectors  must  be  linearly  independent  in  all  the  modulo  P„'s.  This 
means  that  the  maximum  number  of  weight  vectors  is  determined  by  the 
smallest  P„. 


+  C) 


min(Pn)‘^  -  1 
min(P„)  -  1 


(7.17) 


A  simple  way  to  construct  a  set  of  single  fault  correction  weight  vectors 
is  to  use  a  search  method.  Suppose  Mi  is  prime.  Start  with  a  search  space  of 
all  possible  weight  vectors  (u;jv+i,  Wff+2, ...,  wn+c)  where  Wm  =  0, 1, ...,  Mi  — 
1 .  First,  eliminate  the  all-zero  vector  0  from  the  search  space  since  it  is  not  a 
useful  weight  vector.  Then  eliminate  the  checksum  processor  weight  vectors 

=  (0, and  the 
vectors  that  are  linearly  dependent  to  them  [wt^k  where  4>k  —  0, 1,  ...,  Mi  —  l 
and  k  =  N  +  1, ...,  N  -\-C)  from  the  search  space.  Note  that  “-1”  is  equal  to 
'^Mi-V'  in  modulo  Mi  arithmetic.  Then  pick  one  of  the  remaining  vectors 
as  a  weight  vector  and  delete  it  and  all  the  linearly  dependent  vectors 
{wi<f>i  where  <pi  =  0,1,.. .,M/  —  1)  from  the  search  space.  Then  pick  one 
of  the  remaining  vectors  as  another  weight  vector  rv2  and  delete  it  and  all 
the  linearly  dependent  vectors  from  the  search  space.  Then  another  weight 
vector  is  picked  from  the  remaining  vectors  and  so  on.  This  process  is 
repeated  until  the  search  space  is  empty. 

The  chosen  data  processor  weight  vectors  can  be  put  into  a  more  conve¬ 
nient  form  by  normalizing  them  so  that  the  leading  non-zero  coefficient  is 
equal  to  1.  Suppose  a  chosen  weight  vector  has  the  leading  non-zero 
coefficient  a. 


=  (0,0,...,0,q:,  ...)  (718) 

If  Ml  is  prime,  then  q“*  exists.  Therefore,  the  normalized  weight  vector 
W/c  =  (cc~^Wic)  mod  Ml  is  equal  to 

ml  =  (0,0,. ..,0,1,...)  (7.19) 

The  normalized  weights  not  only  reduce  the  number  of  multiplies  in  the 
input  checksum  and  syndrome  calculations,  but  also  provide  a  convenient 
method  of  detection/correction  of  a  fault  if  the  weight  vectors  are  in 
the  normalized  form.  Suppose  the  syndromes  =  (sa^+i,  S/vt2j  •••j 5/v+c) 
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Hk 

t 

m 

s’' 

^l.=4 

Ho 

(0,0) 

Hn+I 

(0, 

1) 

(0. 1) 

(0,  2) 

(0,  3) 

(0,  4) 

Hn+2 

(1, 

0) 

(1,0) 

(2,  0) 

(3,0) 

(4,0) 

H, 

(1, 

1) 

(1, 1) 

(2,  2) 

(3,3) 

(4,4) 

H2 

(1, 

2) 

(1.2) 

(2,4) 

(3,1) 

(4,3) 

Hs 

(1, 

3) 

(1.3) 

(2,  1) 

(3,  4) 

(4,2) 

H, 

(1, 

4) 

(l.l) 

(2,  3) 

(3,  2) 

(4,1) 

Table  7.1:  Weight  Vectors  and  Syndromes  for  Modulo  5  System  {N  =  4, 
C  =  2) 


have  the  leading  non-zero  coefficient  /?.  Let  5  =  /3"*smodM(.  Then  if  the 
failure  hypothesis  Hk  is  true,  we  have  Wjt  =  |  with  ^k  =  P- 

Table  7.1  lists  the  normalized  weight  vectors  and  the  possible  syndrome 
values  for  each  decision  region  for  a  single  fault  correction  system  with  two 
checksum  processors  using  modulo  5  arithmetic.  The  maximum  number  of 
processors  is  iV  -t-  C  =  6.  Notice  that  the  data  processor  weight  vectors  are 
in  a  convenient  form  of  ui[  =  (l,A:)  and  the  leading  non-zero  coefficient  of 
the  syndrome  is  equal  to  the  processor  error. 

Another  simple  method  of  locating  the  faulty  processor  is  to  use  a  lookup 
table  which  has  a  failure  hypothesis  Hk  cissigned  to  all  the  possible  syn¬ 
dromes.  In  conventional  integer  arithmetic,  the  syndrome  space  would  be 
too  large  to  use  such  a  method.  However,  in  modulo  arithmetic,  especially 
when  Ml  and  C  are  relatively  small,  the  syndrome  space  becomes  a  man¬ 
ageable  size  for  such  a  lookup  table  method.  The  Mi  is  usually  small  in 
residue  arithmetic  in  order  to  speed  up  the  computation,  and  C  does  not 
need  to  be  larger  than  two  or  three  in  a  single  fault  correction  scheme, 
unless  N  needs  to  be  very  large.  If  normalized  weights  are  used,  fault  cor¬ 
rection  is  very  simple,  since  the  leading  non-zero  syndrome  is  equal  to  the 
fault  size. 

The  computational  overhead  associated  with  our  fault  detection  and 
correction  methods  can  be  significantly  lower  than  in  the  computational 
overhead  associated  with  the  conventional  fault  tolerance  methods  of  us- 
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ing  extra  residues.  Our  system  also  uses  identical  copies  of  the  residue 
subprocessors  as  checksum  processors.  There  is  no  need  to  design  the  sub- 
processors  using  different  and  larger  residues. 

For  the  Km  fault  correction  system,  we  also  have  to  choose  the  weight 
vectors  so  that  different  sets  of  up  to  Km  processor  failures  do  not  generate 
the  same  syndromes. 


mod  Ml  ^  0 


for 


I 


< 


1  <  ijt  <  iV  +  C' 
1  <  it  <  +  c 

I  <  Km 
J<Km 


(7.20) 


where  (fi,  tj, ...,  i/)  is  not  the  same  set  of  integers  as  (ii,  •••?  Jj))  and  4>i^ 
and  4>:k  non-zero.  This  is  achievable  if  any  set  of  2Km  weight  vectors  is 
linearly  independent  modulo  Mi. 


(E  mod  M,  #  0  for  {  -  1 

Notice  that  the  above  condition  for  the  Km  fault  correction  weight  vectors 
is  the  same  as  the  condition  for  the  2Km  fault  detection  weight  vectors. 

Since  each  failure  decision  region  takes  up  (Mj  —  1)'”  points  in  the  syn¬ 
dromes  space  where  m  is  the  number  of  failed  processors,  the  maximum 
number  of  the  processors  in  Km  fault  correction  when  Mi  is  prime  is  deter¬ 
mined  by  the  following  equation. 


K„ 


jwf-i>  E 


(JV  +  C)! 


(A^  -t-  C  —  m)!m! 


;(M,  -  1)’ 


(7.22) 


When  Ml  is  not  prime,  but  is  equal  to  Mi  —  then  any  set  of 

2Km  weight  vectors  has  to  be  linearly  independent  in  all  the  modulo  Pn's 
in  order  to  satisfy  the  unique  decision  region  condition  in  equation  7.20. 
Suppose  that  two  sets  of  Km  weight  vectors  are  linearly  dependent. 


Km  \ 

+  X!  )  mod  P„  =  0 


(7.23) 
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Then  for  the  processor  error  values  of  =  ajt  flm/n  and  Pm, 

we  have 


Y1  4>ikUli^  +  51 

fc=i 

Therefore,  processor  errors  of  sizes  <f>i^  on  processors  (ti,  ik^)  yield 

exactly  the  same  syndromes  as  failures  of  sizes  on  processors  (ji,  ^21  jK„  )• 
One  cannot  distinguish  between  these  failure  modes  and  cannot  reliably 
correct  the  faults.  Therefore,  in  order  to  satisfy  the  unique  decision  re¬ 
gion  condition  in  equation  7.20,  any  set  of  2Kr7i  v/eight  vectors  must  be 
linearly  independent  in  the  modulo  for  n  =  This  means  that 

the  maximum  number  of  weight  vectors  are  determined  by  the  smallest 


I  M  =  Q 


(7.24) 


min(P„)'^  -  I  >  Y. 


{N  +  C)\ 


{N  +  C  —  m)!m! 


(min(P„)  -  1)' 


(7.25) 


A  simple  way  to  construct  a  set  of  weight  vectors  is  to  use  a  simi¬ 
lar  weight  vector  search  method  as  the  one  used  in  the  single  fault  cor¬ 
rection  case.  Suppose  we  need  weight  vectors  for  fault  correction 
when  A/j  is  prime.  Start  with  a  search  space  of  all  possible  weight  vec¬ 
tors  (u)7v+i,  wn^2,  •••i^N+i)  where  =  0,1,...,  Mi  —  1.  First,  eliminate 
the  all-zero  vector  0  from  the  search  space.  Then  eliminate  the  checksum 
processor  weight  vectors  Wff+2,-",  UIn+i  eliminate  all  the  lin¬ 

ear  combinations  of  up  to  2Km  —  1  checksum  processor  vectors  from  the 
search  space  (i.e.  eliminate  mk,<Pk,  where  N+l<ki<N  +  C,  and 

(f>k,  =  0,1,.. ..Ml  —  1).  This  guarantees  that  the  linear  combination  of  any 
2/fm  —  1  checksum  processor  weight  vectors  will  be  linearly  independent 
from  the  vectors  remaining  in  the  search  space.  Therefore,  the  next  weight 
vector  chosen  will  not  violate  the  condition  that  any  2Km  weight  vectors 
have  to  be  linearly  Independent. 

Then  pick  one  remaining  vector  in  the  space  as  a  weight  vector  Wj  and 
delete  it  and  all  the  linear  combinations  of  up  to  2Km  —  1  already  chosen 
weight  vectors  from  the  search  space  (i.e.  delete  Wici<f>k,  from  the 

remaining  set  where  4>k,  =  0,1,..., Mi  —  1).  Then  another  weight  vector  is 
chosen.  This  process  of  choosing  weight  vectors  is  repeated  until  the  search 
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Ml  =  3 

TV  +  C  =  5 

Ml  5 
iV  -t-  C  =  6 

Ml  =  7 

N  -\-C  =  S 

Ml  -  11 

iV  +  C  =:  8 

Ml  =  13 

N  +  C  =  9 

Ml  =  17 

N  +  C  =  9 

(0,  0,  0,  -1) 
(0,  0,  -1,  0) 
(0,  -1,  0,  0) 
(-1,  0,  0,  0) 
(1, 1, 1, 1) 

(0,  0,  0,  -1) 
(0,  0,  -1,  0) 
(0,  -1,  0,  0) 
(-1,  0,  0,  0) 
(1, 1, 1, 1) 
(1,2,  3,  4) 

(0,  0,  0,  -1) 
(0,  0,  -1,  0) 
(0,  -1,  0,  0) 
(-1,  0,  0,  0) 
(1,1,1, 1) 
(1,  2,  3,  4) 
(1,5,  6,  2) 
(1,6,  5,  3) 

1 

(0.  0,0, -1) 
(0,  0,-l,0) 
(0,  -1,  0,  0) 
(-1,0,  0.  0) 
(1,  1,  1,  1) 
(1,2,  3,  4) 
(1,  3,  2,  5) 
(1,4,  5,9) 

(0,  0,  0,  -1) 
(0,  0,-l,0) 
(0,  -1, 0, 0) 
(-1,  0,  0,  0) 
(1, 1, 1, 1) 
(1,2,  3,  4) 
(1,3,  2,  5) 
(1,4,  5,  3) 
(1,7,  6,  2) 

(0,  0,  0,  -1) 
(0,  0,  -1,  0) 
(0,  -1,  0,  0) 
(-1,  0,  0,  0) 
(1,  1,  1,  1) 
(1,2,  3,  4) 
(1,3,  2,  5) 
(1,4,5,  9) 
(1,5,  4,  8) 

Table  7.2:  Weight  Vectors  for  Double  Fault  Correction  [C  =  4) 

space  is  empty.  One  can  end  up  with  normalized  weight  vectors  by  picking 
Wic  to  be  in  a  normalized  form  with  the  leading  non-zero  coefficient  equal 
to  one. 

Note  also  that  for  single  fault  correction,  the  bound  on  N  +  C  is  tight 
and  can  be  achieved.  For  Km  >  1,  however,  the  b  ind  is  not  tight  and 
may  not  find  an  appropriate  set  of  vectors  spanning  the  whole  space. 

Table  7.2  and  Table  7.3  show  the  double  fault  correction  weight  vectors 
for  C  =  4  and  C  =  5  found  by  the  above  search  method.  The  number  of 
data  processors  is  much  greater  for  C  =  5  than  for  C  =  4.  It  is  clear  from 
these  examples  that  it  is  likely  that  more  than  the  minimum  number  of 
checksum  processors  may  be  required  in  order  to  implement  double  fault 
correction. 


104 


Ml  =  2 

V  +  C  =  11 

M,  =  5 

V  -!-  C  =  11 

Ml  ^7 

N  +  C  =  \6 

(0,  0,  0,  0,  -1) 
(0,  0,  0,  -1,  0) 
(0,  0,-l,0,  0) 
(0,  -1,0,  0,  0) 
(-1,  0,  0,  0,  0) 
(0,  1,  1,  1,  1) 
(1,0,  1,1,2) 
(1,  1,0,  2,  1) 
(1,  1,  2,  0,  2) 
(1,2,  1,2,  0) 
(1,2,  2,  1,  1) 

(0,  0,  0,  0,  -1) 
(0,  0,  0,-l,0) 
(0,  0,  -1,  0,  0) 
(0,  -1,  0,  0,  0) 
(-1,0,  0,  0,  0) 
(0,  1,  1,  1,  1) 
(0,  1,  2,  3,  4) 
(1,0,  1,  1,  2) 
(1,0,  2,  3,  3) 
(1,1,0,  1,3) 
(1,  1,  1,  3,  0) 

(0,  0,  0,  0,  -1) 
(0,  0,  0,  -1,  0) 
(0,  0,  -1,  0,  0) 
(0,  -1,  0,  0,  0) 
(-1,  0,  0,  0,  0) 
(0,  1,  1,  1,  1) 
(0,  1,  2,  3,  4) 
(0,  1,  5,  6,  2) 
(0,  1,  6,  5,  3) 
(1,0,  1,  1,2) 
(1,0,  2,  3,  1) 
(1,  0,  6,  5,  6) 
(1,  1,0,  5,  1) 
(1,  1,  1,4,0) 
(1,2,  0,  1,4) 
(1,2,  3,2,6) 

Table  7.3;  Weight  Vectors  for  Double  Fault  Correction  [C  —  5) 


Chapter  8 

Practical  Architectures 


In  this  chapter,  we  shall  discuss  possible  architectures  for  implementing  our 
single  fault  correction  multiprocessor  system.  We  limit  our  discussion  to  the 
datapath  of  the  system.  Of  course  the  clocks,  controls,  power  sources,  and 
other  parts  of  the  system  have  to  be  designed  reliably  (by  triple  modular 
redundancy  or  by  other  methods)  so  that  their  failure  rates  do  not  dominate 
the  system  failure  rate. 

In  our  proposed  architectures,  all  the  data  paths,  except  the  ones  that 
are  protected  by  our  fault  tolerant  algorithm,  are  triplicated  for  reliability. 
If  they  are  not  triplicated,  it  is  likely  that  their  failure  rate  would  dominate 
the  system  failure  rate.  Although  it  is  possible  to  design  a  reliable  data 
path  without  triplicating  through  error  coding  techniques,  it  is  not  clear 
that  one  can  make  such  data  paths  as  reliable  as  the  triplicated  ones.  For 
example,  if  the  error  coded  bus  system  experiences  a  failure  in  one  of  its 
bus  driver  chips,  the  entire  bus  system  may  be  stuck  due  to  a  single  fail¬ 
ure.  In  our  architectures  we  have  also  2issumed  that  the  decoder,  which  is 
the  hardware  module  that  is  responsible  for  fault  detection  and  the  faulty 
output  correction  from  the  syndromes,  is  triplicated  for  reliability  as  well. 
The  entire  datapath  is  designed  so  that  any  single  component  failure  would 
not  cause  the  failure  of  the  entire  datapath. 
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8.1  Single  Bus  Architecture 

The  single  bus  architecture  is  a  simple  and  yet  very  efficient  datapath  ar¬ 
chitecture  for  implementing  our  fault  tolerant  multiprocessor.  Figure  8.1 
shows  the  single  bus  datapath  architecture  for  the  single  fault  correction 
system.  There  are  N  data  processors  and  C  checksum  processors  in  the 
system.  It  is  designed  so  that  any  single  component  failure  would  not  cause 
a  system  failure.  The  main  bus  is  triplicated  for  reliability.  There  is  also 
a  bus  guardian  (BG)  unit  attached  to  each  processor.  The  bus  guardian 
unit  is  responsible  for  driving  the  bus  during  the  output  phase  and  isolat¬ 
ing  the  bus  from  the  bus  driving  circuit  during  the  non-output  phase.  The 
bus  guardian  unit  is  designed  so  that  a  single  component  failure  would  not 
disable  more  than  one  of  three  main  busses.  A  local  bus  connects  the  bus 
guardian  unit  to  the  data  processor.  This  local  bus  does  not  need  to  be  trip¬ 
licated  since  it  is  protected  by  our  fault  tolerance  algorithm.  To  the  system, 
a  failure  in  a  local  bus  would  be  equivalent  to  and  indistinguishable  from  a 
failure  in  the  corresponding  processor.  For  a  local  bus  that  is  connected  to 
the  checksum  processor,  there  is  also  a  checksum  calculator  attached  to  it. 
The  checksum  calculator  is  responsible  for  calculating  the  input  checksum 
and  the  syndrome,  and  has  two  sets  of  internal  accumulators:  one  for  the 
input  checksum  and  one  for  the  syndrome.  These  checksum  calculators  are 
not  triplicated  since  they  are  also  protected  by  our  fault  tolerance  algo¬ 
rithm.  A  failure  in  a  checksum  calculator  would  appear  to  the  system  no 
differently  from  a  failure  in  the  corresponding  checksum  processor. 

Figure  8.3  shows  an  example  design  for  the  bus  guardian  unit  operated 
by  triplicated  control  signals.  During  the  input  phase,  the  information  from 
the  triplicated  main  bus  is  voted  on  before  being  passed  on  to  the  local  bus. 
During  the  output  phzise,  the  signals  on  the  local  bus  drive  all  three  main 
busses.  The  bus  guardian  unit  is  designed  so  that  a  failure  in  one  of  its 
components  would  not  disable  more  than  one  of  the  three  main  busses. 

Figure  8.2  shows  the  timing  diagram  for  the  architecture.  The  system 
processes  a  batch  of  N  input  data  segments  at  a  time.  Each  segment 
consists  of  p  input  data  points  and  q  output  data  points.  As  the  input  data 
comes  in  through  the  main  bus,  it  gets  loaded  onto  the  data  processors 
through  the  bus  guardian  unit  and  the  local  bus.  The  first  input  data 
segment  ij  is  sent  to  the  first  data  processor,  the  second  data  segment 
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Figure  8.1:  Single  Bus  Architecture 
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Figure  8.2;  Single  Bus  Architecture  Timing  Diagram 


out_enable 
in  enable 


Figure  8.3:  Bus  Guardian  Unit 


ij  is  sent  to  the  second  data  processor,  and  so  on.  Each  data  processor 
starts  processing  the  input  segment  as  soon  as  it  is  received.  Therefore, 
the  data  processors'  computations  will  start  in  a  staggered  order.  The 
checksum  calculators,  in  the  m.eantime,  are  in  the  process  of  calculating 
the  input  checksums.  They  take  the  input  data  x^’s  off  the  main  bus, 
multiply  it  by  the  appropriate  weight  *’s,  and  sum  the  results  into  the 
input  checksum  accumulators.  When  input  has  been  received  and  the 

checksum  calculators  have  finished  calculating  the  input  checksums,  the 
input  checksums  are  then  sent  to  the  corresponding  checksum  processors 
through  the  local  bus. 

As  the  data  processors  finish  processing  the  data,  they  send  the  output 
y^’s  through  the  main  bus  to  the  decoder.  While  the  outputs  are  being  sent 
to  the  decoder,  the  checksum  calculators  compute  the  output  checksums  the 
same  way  they  compute  the  input  checksums.  They  take  the  output  data 
y^^’s  off  the  main  bus,  multiply  them  by  appropriate  weight  ly^  t’s,  and  sum 
the  results  in  the  syndrome  accumulators.  In  a  checksum  calculator,  there 
are  two  separate  accumulators  for  the  input  checksum  and  the  syndrome. 
When  the  checksum  processors  are  finished  processing  the  input  checksums, 
they  send  the  result  to  the  checksum  calculator  through  the  local  bus. 
The  checksum  calculators  then  compute  the  syndromes  .s„j’s  by  subtracting 
the  output  checksums  from  their  accumulators  from  the  outputs  of  the 
checksum  processors.  The  resulting  syndrome  is  then  sent  to  the  triplicated 
decoder  through  the  main  bus.  It  takes  C  data  transfer  cycles  to  finish 
this  syndrome  transfer  process.  From  the  data  processor  outputs  and  the 
syndromes,  the  decoder  detects  and  corrects  the  faulty  processor  output 
and  then  outputs  the  result. 

This  is  a  very  heavily  pipelined  architecture.  While  a  batch  of  input 
data  is  going  into  the  data  processors,  the  processed  output  of  the  previous 
batch  is  being  sent  to  the  decoder,  and  the  fault  detected/corrected  results 
of  the  batch  before  that  are  being  outputted  from  the  decoder.  Notice 
that  the  timing  has  been  carefully  pipelined  so  that  the  idle  times  of  the 
processors  are  minimized.  For  example,  the  input  and  output  phases  of 
the  data  processor  are  interleaved  so  that  ais  soon  as  a  data  processor  has 
finished  outputting  the  result  through  the  main  bus,  it  receives  the  next 
batch  of  input  data,  also  through  the  main  bus.  The  same  goes  for  the 
checksum  processors.  All  the  checksum  processors  simultaneously  output 
their  result  to  the  corresponding  checksum  calculators  through  the  local 
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bus.  As  soon  as  the  outputting  is  completed,  the  new  set  of  input  checksums 
are  immediately  sent  from  the  checksum  calculators  to  the  corresponding 
checksum  processors  through  the  local  bus. 

The  major  bottleneck  of  this  system  is  the  bandwidth  of  the  main  bus. 
The  main  bus  should  have  enough  bandwidth  to  take  care  of  all  the  data 
processor  inputs  and  outputs  as  well  as  the  syndrome  transfers.  If  the 
processor  computation  time  per  batch  is  shorter  than  the  time  needed  for 
data  transfers,  then  the  processors  would  sit  idle  for  at  least  part  of  the 
time. 


8.2  Unidirectional  Data  Flow  Architecture 

There  are  some  situations  when  it  is  advantageous  to  make  the  system  data 
flow  unidirectional.  For  example,  if  the  data  flow  is  unidirectional,  we  can 
use  multiplexers  instead  of  busses  with  bus  guardian  units  in  some  places. 
In  multiplexers,  it  is  easier  to  prevent  a  faulty  component  from  jamming 
the  entire  datapath.  Figure  8.4  shows  an  example  of  unidirectional  data 
flow  architecture.  In  the  input  stage,  a  triplicated  input  bus  is  used.  Notice 
that  simple  voters  (shown  as  “V”)  can  be  used  on  the  receiving  end  instead 
of  the  complex  bus  guardian  unit,  assuming  that  the  voter  is  designed  such 
that  a  faulty  voter  does  not  load  down  more  than  one  input  bus  at  a  time. 
The  outputs  of  the  processors  are  sent  to  the  triplicated  decoder  through 
a  triplicated  multiplexer  rather  than  through  a  bus  structure.  There  are 
also  separate  hardware  modules  for  input  checksum  calculation  and  the 
syndrome  calculation. 

The  timing  diagram  for  the  unidirectional  data  flov/  architecture  is  in 
figure  8.5.  The  input  data  is  loaded  onto  the  data  processors  the  same 
as  in  the  single  bus  architecture.  The  first  batch  of  the  data  ij  goes  to 
the  first  processor,  the  second  batch  to  the  second  processor,  and  so 
on.  The  data  processors  start  processing  input  data  as  soon  as  they  receive 
them.  As  the  input  data  is  being  loaded  onto  the  data  processors,  the  input 
checksum  calculators  are  in  the  process  of  calculating  the  input  checksums. 
They  take  the  input  data  of  the  bus,  multiply  by  appropriate  weights,  and 
sum  the  result  into  their  accumulators.  As  soon  as  the  input  checksums  are 
calculated,  they  are  sent  to  the  corresponding  checksum  processor,  which 
starts  processing  them  immediately.  When  the  data  processors  are  finished 
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with  processing  the  data,  they  send  the  results  to  the  decoder  one  after 
another  through  the  multiplexer.  While  the  data  processors  are  outputting 
their  results,  the  syndrome  calculators  are  computing  output  checksums. 
After  all  the  data  processors  have  finished  outputting  the  results,  all  the 
checksum  processors  send  their  results  to  the  syndrome  calculators.  The 
syndrome  calculators  subtract  the  output  checksums  from  the  checksum 
processor  outputs  to  compute  the  syndromes.  The  resulting  syndromes  are 
then  sent  to  the  decoder  through  the  triplicated  multiplexer.  The  decoder 
then  locates  and  corrects  the  faulty  output  and  outputs  the  result. 


8.3  Variations 

In  this  architecture,  it  is  also  possible  to  vary  the  number  of  checksum 
processors  used  depending  on  the  requirements  of  the  application.  For  ex¬ 
ample,  if  the  application  does  not  require  immediate  correction  of  the  faulty 
processor  output,  one  can  use  the  system  in  a  single  fault  detection  configu¬ 
ration  instead  of  the  single  fault  correction  configuration.  In  that  case,  one 
can  use  all  the  processors  except  for  one  of  the  checksum  processors  as  the 
data  processors.  The  resulting  single  fault  detection  system  has  (JV  -t-C  —  1) 
data  processors  and  one  checksum  processor,  and  thus  has  more  processing 
power. 

It  is  also  possible  to  give  a  checksum/syndrome  calculator  to  every  pro¬ 
cessor.  This  way,  any  processor  can  be  a  checksum  processor  which  im¬ 
proves  the  reconfigurability.  As  we  discussed  in  section  3.2.3  and  3.3.3, 
if  the  checksum  processor  hats  the  same  dynamic  range  as  the  data  pro¬ 
cessors  in  the  non-exact  arithmetic  systems,  a  small  fault  in  a  checksum 
processor  is  more  easily  detected  than  in  the  data  processors.  With  a 
checksum/syndrome  calculator  built  into  every  processor,  one  can  improve 
the  detectability  of  the  small  faults  in  all  the  processors  by  periodically 
rotating  different  processors  to  be  the  checksum  processors. 

Having  to  design  the  checksum/syndrome  calculators  and  the  decoders 
as  different  hardware  modules  may  make  the  system  hardware  design  more 
complex  than  desired.  It  may  be  possible  in  some  cases  to  use  processors 
as  the  chetksum/syndrome  calculators  and  the  decoders.  This  may  be 
especially  useful  if  the  processors  used  are  single  chip  processors. 

The  input  bus  associated  with  the  architecture  in  the  previous  section 
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Figure  8.4:  Unidirectional  Data  Flow  Architecture 
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Figure  8.5:  Unidirectional  Data  Flow  Architecture  Timing  Diagram 

can  be  used  for  a  variety  of  purposes  other  than  just  inputting  the  data. 
One  can  use  them  to  load  programs,  do  control,  etc.  In  order  to  use  the 
input  bus  for  these  multiple  tasks,  one  may  want  to  build  additional  control 
structures  into  the  system  such  as  interrupts,  hags,  etc.  The  input  bus  is 
also  one  of  the  major  bottlenecks  associated  with  the  architectures  in  the 
previous  sections.  This  bottleneck  may  be  relieved  by  using  multiple  input 
busses.  However,  the  design  of  such  a  scheme  is  complicated  by  the  fact 
that  the  input  data  has  to  be  sent  to  all  the  checksum  processors  as  well  as 
the  corresponding  data  processor. 

It  is  also  possible  to  have  spare  processors  that  would  replace  the  faulty 
processors.  This  would  be  appropriate  if,  for  example,  the  mean  times 
till  failure  of  the  processors  are  much  shorter  than  the  rest  of  the  system 
hardware.  In  this  case,  the  system  keeps  track  of  the  failure  history  of  the 
processors,  and  if  any  processor  is  repeatedly  diagnosed  as  being  faulty, 
the  system  replaces  it  with  a  spare  processor.  In  caise  there  are  no  spare 
processors  left,  the  system  can  reconfigure  itself  from  the  single  fault  correc¬ 
tion  configuration  to  the  single  fault  detection  configuration.  However,  the 
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complexity  associated  with  such  a  reconfigurable  system  may  not  justify 
the  increcise  in  performance. 
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Chapter  9 
Conclusion 


In  this  thesis,  we  have  proposed  a  fault-tolerant  multiprocessor  architecture 
which  htis  much  less  redundant  hardware  associated  with  the  fault  tolerance 
than  the  Modular  Redundancy  techniques.  The  architecture  uses  weighted 
checksum  techniques  and  is  suited  for  linear  digital  signal  processing  appli¬ 
cations  that  need  to  use  multiple  copies  of  identical  processors  in  order  to 
meet  the  throughput  requirement. 

The  system  consists  of  N  identical  linear  data  processors  and  C  check¬ 
sum  processors.  The  input  to  the  checksum  processors  are  weighted  sums 
of  the  data  processor  inputs.  The  fault  detection/correction  is  done  using 
the  syndromes  which  are  the  differences  between  the  checksum  processor 
output  and  the  appropriately  weighted  sums  of  the  data  processor  outputs. 
When  there  is  no  fault,  all  the  syndromes  are  equal  to  zero.  When  there 
are  faults,  the  syndromes  are  linear  combinations  of  the  weight  vectors  of 
the  faulty  processors.  One  needs  a  minimum  Lm  checksum  processors  for 
Lm  fault  detection,  and  a  minimum  of  2Kjn  checksum  processors  for  Km 
fault  correction.  For  Lm  fault  detection  the  linear  combinations  of  any  Lm 
weight  vectors  should  be  linearly  independent  from  each  other.  For  Km 
fault  correction  the  linear  combinations  of  any  2Km  weight  vectors  should 
be  linearly  independent  from  each  other.  The  low  integer  weights  are  good 
weights  for  single  fault  correction  because  multiplication  by  small  integers 
requires  less  computational  effort  than  a  full  multiply  and  also  because  they 
spread  the  weight  vectors  efficiently  in  the  syndrome  space.  Good  weights 
for  the  multiple  fault  correction  are  difficult  to  find. 

When  fixed  point  or  floating  point  arithmetic  is  used,  the  computation 
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is  no  longer  exact  and  the  processor  outputs  and  the  syndromes  contain 
numerical  roundoff  or  truncation  noise.  In  this  case,  the  numerical  noise 
is  modeled  as  random  variables  and  statistical  fault  detection/correction 
methods  have  been  developed.  For  the  single  fault  detection  case,  the  fault 
is  detected  if  the  syndrome  energy  level  exceeds  the  preset  threshold.  For 
the  single  fault  correction  Ccise,  the  projection  energy  threshold  method 
is  derived  for  the  fault  diagnosis.  This  method  projects  the  syndromes 
onto  the  weight  vectors  and  performs  a  threshold  test  on  the  normalized 
projection  energy.  If  the  energy  of  only  one  projection  exceeds  the  preset 
threshold,  the  corresponding  processor  is  declared  faulty.  If  more  than  one 
projection  energy  exceeds  the  threshold,  the  one  that  exceeds  the  threshold 
by  the  most  amount  is  declared  faulty.  The  projection  of  the  syndrome  onto 
the  faulty  processor  v/eight  vector  is  used  for  the  fault  correction,  since  it 
represents  the  weighted  value  of  the  most  likely  processor  error.  When 
the  numerical  noise  is  white,  the  projection  method  can  be  simplified.  An 
efficient  algorithm  for  the  fault  detection/correction  computation  in  this 
case  v/as  also  derived. 

Since  the  numerical  noise  is  modeled  as  a  statistical  process,  the  fault 
detection/correction  is  no  longer  exact.  These  fault  diagnosis  methods 
have  certain  probabilities  of  misdiagnosis.  However,  in  properly  designed 
systems,  such  misdiagnosis  does  not  have  fatal  consequences  to  the  system 
application.  The  net  effect  of  the  misdiagnosis  in  such  systems  is  nothing 
more  than  the  slight  increase  in  the  numerical  noise  level  of  some  of  the 
processor  outputs.  The  probability  of  misdiagnosis  decreases  when  the 
weight  vectors  are  spread  far  apart  in  the  syndrome  space. 

The  projection  fault  detection/correction  method  also  equivalent  to 
the  result  of  the  generalized  likelihood  ratio  test  in  which  the  numerical 
noises  are  assumed  to  be  Gaussian  random  variables.  Using  the  generalized 
likelihood  ratio  test,  one  can  also  derive  a  method  for  using  the  syndromes 
not  only  to  detect  and  correct  the  faults,  but  also  to  filter  out  some  of  the 
numerical  noises  of  the  working  processors.  However,  the  potential  payoff 
and  the  practicality  of  such  noise  filtering  are  questionable. 

A  few  other  simple  ad  hoc  methods  for  single  fault  correction  in  the 
presence  of  the  numerical  noise  have  also  been  derived.  Some  of  them 
are  quite  simple  and  computationally  more  efficient  than  the  projection 
threshold  method,  although  the  probability  of  misdiagnosis  is  higher. 

The  projection  method  and  the  generalized  likelihood  ravio  test  have 
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also  been  derived  for  the  multiple  fault  correction  in  the  presence  of  nu¬ 
merical  noise.  However,  such  a  multiple  fault  correction  system  may  not  be 
very  practical  for  the  following  reasons.  First,  it  is  hard  to  find  good  weight 
vectors  for  multiple  fault  correction.  Second,  it  is  hard  to  design  the  rest 
of  the  system  hardware  to  be  as  reliable  as  the  protected  processor  group 
without  using  a  modular  redundancy  technique.  Third,  the  computational 
effort  required  for  detection/correction  of  the  faults  increases  dramatically 
as  the  number  of  faults  that  must  be  tolerated  increases,  and  the  system 
can  become  computationally  less  efficient  than  the  modular  redundancy 
systems  of  comparable  reliabilities. 

The  simulations  of  the  single  fault  correction  projection  method  have 
been  carried  out  using  an  example  fixed  point  system.  Properties  that  are 
difficult  to  compute  analytically,  such  as  false  alarm  rate  and  the  numerical 
noise  in  the  corrected  processor  output,  are  simulated.  In  addition,  the 
dependence  of  the  fault  diagnosis  method  on  variables  such  as  the  batch 
size,  the  processor  reliabilities,  and  the  numerical  noise  levels  were  sim¬ 
ulated.  We  also  did  a  simulation  to  find  the  probability  distribution  of 
the  syndrome  numerical  noise  under  the  assumption  that  all  the  roundoff 
operations  are  non-correlated.  In  order  to  find  out  whether  the  roundoff  op¬ 
erations  in  real  systems  are  indeed  non-correlated,  a  fixed  point  single  fault 
detection  system  was  simulated  with  the  processing  task  of  FIR  filter  and 
FFT.  The  roundoff  operations  in  the  FIR  filtering  were  found  to  have  little 
correlation,  but  the  roundoff  operations  in  FFT  were  heavily  correlated, 
with  the  resulting  noise  level  much  lower  than  if  they  were  non-correlated. 

When  exact  integer  arithmetic  is  used,  one  can  achieve  reduction  in 
hardware  by  using  modulo  arithmetic  in  the  checksum  processors  and  in  the 
calculations  of  the  input  checksums  and  the  syndromes.  The  modulos  in  the 
form  of  2**  are  easy  to  implement.  We  have  also  developed  a  method  of  ap¬ 
plying  the  error  coding  technique  to  the  multiple  fault  detection/correction 
systems  using  integer  arithmetic.  This  eliminates  the  difficulty  of  finding 
good  weight  vectors  for  the  multiple  fault  detection/correction  systems, 
and  can  also  reduce  the  computational  effort  involved  with  multiple  fault 
detection/correction.  Our  fault  tolerance  method  can  also  be  used  in  the 
residue  number  system  processors  very  efficiently. 

Practical  architectures  for  the  single  fault  detection/correction  system 
are  presented.  Single  bus  architecture  and  unidirectional  data  flow  archi¬ 
tectures  are  discussed,  along  with  the  flexible  ways  they  can  be  used  to 
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meet  the  various  needs  of  the  application. 

The  following  are  suggestions  for  further  research  in  this  topic.  The 
floating  point  numerical  noise  probability  distribution  and  the  effects  of 
the  floating  point  numerical  noise  in  fault  detection/correction  algorithms 
need  to  be  studied  in  more  detail.  It  is  also  possible  to  extend  the  fault  tol¬ 
erant  multiprocessor  architecture  to  non-linear  applications.  The  key  idea 
behind  the  weighted  checksum  architecture  is  that  the  weighted  checksum¬ 
ming  process  and  the  linear  operation  process  F  commute.  It  is  likely  that 
there  are  non-linear  processing  applications  in  which  the  desired  proces¬ 
sor  operation  and  another  operation  commute  in  a  similar  way  to  achieve 
fault  tolerance.  The  search  for  such  operation  pairs  and  the  development 
of  appropriate  fault  detection/correction  algorithms  are  desired. 

Our  fault-tolerant  weighted  checksum  multiprocessor  architecture  not 
only  achieves  high  reliability  with  low  hardware  overhead,  but  is  also  ap¬ 
plicable  to  any  linear  digital  signal  processing  applications.  We  have  de¬ 
veloped  very  efficient  fault  detection /correction  algorithms  for  both  exact 
arithmetic  systems  and  non-exact  arithmetic  floating  point  or  fixed  point 
systems.  Our  architectures  can  mask  any  single  point  failure  and  are  flexible 
to  meet  the  varying  degree  of  fault  tolerance  required  by  the  application. 
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Appendix  A 

Proof  of  Section  2.2.4 


These  appendix  are  taken  from  Musicus  and  Song’s  paper  [Musicus  88].  Let 
^  be  the  {N  +  C)g  length  vector  of  the  N  -{-C  processor  errors 
If  K  processors  fail,  then  the  corresponding  error  vector  ^  has  K 

non-zero  block  elements  •  Let  Qk  be  the  set  of  all  such  possible 

—Cl  _ 

error  vectors  corresponding  to  up  to  K  processor  failures.  Let  s  =  — 
be  the  syndromes  corresponding  to  the  errors  0.  Then  we  require: 

a)  To  reliably  detect  up  to  failures,  the  syndromes 

must  always  be  non-zero  for  any  non-zero  error  vector  in 

^K^+Lm- 

-W^T^O  for  all  0  G  ^  ^  0  (A.l) 

b)  To  reliably  correct  up  to  Km  failures,  the  syndromes  corre¬ 
sponding  to  any  two  different  such  failures  must  be  different: 

-  7^  -W0  for  all  ^  G  nx„.  ,  ^  ^  (A.2) 

c) To  reliably  distinguish  a  situation  involving  only  Km  or  fewer 
errors  from  one  involving  between  Km  and  Km  +  i'm  errors,  so 
that  we  can  try  to  correct  the  former,  we  must  be  able  to 
distinguish  the  syndromes: 

for  all  ^  G  0  G  ^  ^  (A.3) 
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Note  that  0-06  if  0  6  0^  and  0  6  Combining  A.l,  A. 2,  and 

A. 3,  we  will  therefore  need; 


-W0  7^O  for  all  0  6  ,  0  7^  0  (A.4) 

This  in  turn  implies  that  every  set  of  2K,n  +  block  columns  of  W  must 
be  linearly  independent.  But  this  implies  that  every  set  of  2Km  +  Lm  weight 
vectors  (tyjv+i.jt-.WN+c,*:)^  selected  from  k  =  +  C  must  be  linearly 

independent.  But  this  can  only  be  possible  if  C  >  2Km  +  Lm- 

Q.E.D. 
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Appendix  B 

Derivation  of  GLRT  (Proof  of 
Section  4.1) 


To  derive  the  formula  for  Lq,  we  apply  Bayes  Rule  and  then  substitute  the 
Gaussian  formula: 


Lo  =  logp(s,//o) 

=  logP(s|//o) +  logP(Ro) 

1  1  N-\-C 

=  -r/V-’s-  -log|2jrV|+  ^  log(l  - />„)  (B.l) 

^  ^  m-\ 

To  derive  the  formula  for  Lk  for  k  ~  1,...,  A  +  C,  we  start  by  applying 
Bayes  Rule  to  Lk,  then  substitute  the  Gaussian  formula: 


Lk  =  maxlogp(s,i/fc|^,) 

t 

--  rnax  logp(s|//*,<^  )  +  logp(//t) 

t 

-  max-i(5-f  WfcA$tWr)-‘(l  + WfcJ  )  (B.2) 

2 
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J  N+C 

-  -  log  |27r(v  -  W* W[)  I  +  log  P*  +  log(l  -  P^) 

Zt 

m  =  1 
m  ^  k 

where  A^jt  =  To  maximize  over  0^,  we  apply  the  following  lemma; 

Lemma  A 

max-U  +  -  WtA$W[)-'(5  +  W*5J  (B.3) 

s 

=  max-U+ WtJJ^V-'(5  + (B.4) 

t 

and  the  maximum  for  both  expressions  is  achieved  at  the  same  value  of 

1  =  -( W[V-‘ W*)-*  W[V-  's  (B.5) 

Proof:  Ma.ximize  the  expressions  above  by  differentiating  with  respect  to 
then  setting  the  derivatives  to  zero.  We  find  that  (B.4)  is  maximized 
by  (B.5),  while  (B.3)  is  maximized  by: 

h  =  -  (W^V  -  WtA^tWj)-'Wt)~'  W[(V  -  WfcA$*W[)-'s  (B.6) 
To  simplify  this,  we  use  a  modified  form  of  the  Woodbury  ABCD  lemma: 


(V  -  WjA^tWj')"'  =  V‘  +  V' W»A«i(I  -  Wjv' WiA*t)“'Wj'V‘ 

(B.7) 

This  is  most  easily  proved  by  multiplying  both  sides  by  the  inverse  of  the 
left  side  and  simplifying.  (Note  that  it  is  true  even  if  A^*  is  not  invertible.) 
Multiplying  by  Wj^; 

W[(V  -  WfcA$*W[)-*  -  (I  -  Wj'V-‘WfcA$fc)'*W[V''  (B.8) 

Substituting  into  (B.6)  shows  that  (B.6)  and  (B.5)  are  identical.  Now  note 
that: 
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W[V-'(s  +  Wt^J  =  W[V-'(I- Wjt(W[V-'W*)''W[V-‘)5 

=  Q  (B.9) 

Substituting  into  expression  (B.3)  and  applying  the  Woodbury  formula 
again  shows  that: 

-U  +  w4j"’(v  -  w*A$*wr)-'U  + 

=  -U  + w4j^v-*(s  + 

=  -/V-‘5  +  /V'*Wfc(Wj'V-^Wife)-’W[V''s  (B.IO) 

Applying  this  lemma  to  !<*  gives  the  formula  in  section  7. 

To  compute  the  relative  likelihoodL  L'f.  =  2{Lk  —  Lo),  we  exploit  equation 
(B.IO).  Performing  the  subtraction  gives: 

L[  =  /V->W*(W[V-‘W*)-*  +  ')[  (B.ll) 

where: 


7^  =  -  log  |27r(V  -  W*A$tWr)|  +  log  |27rV|  +  2  log(^^^)  (B.12) 

To  simplify  this  formula,  note  that  the  following  matrix  can  be  factored  in 
two  different  ways: 


I  w[ 

WfcA^fc  V 

z= 

I  0 

I 

I  0 

0  V  -  WfcA$fcW[ 

I 

0 

wn 

1 

[I  w^l 

■  I  wj'v-*  ■ 

■  I  -  W'[V->WfcA$* 

0 

f 

(B.13) 
I  0  1 

.  W*A$t  V 

0  I 

0 

V 

[V-iWfcA$A  Ij 

Taking  the  determinants  of  the  matrices  in  (B.13)  and  equating  them  gives: 


|Ii  |V  -  WfcA$*W[ 


W[V-‘W*A4>t  |V| 


(B.14) 
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Taking  the  log  of  both  sides  gives: 


=  -  log  !I  -  WrV-‘ W* A<f>fci  +  2  log{ 


Pk 

l~Pk 


(B.15) 
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Appendix  C 

Derivation  of  Alternate  GLRT 
(Proof  of  Section  4.2) 


Under  hypothesis  ifoi  the  processor  outputs  y  and  syndromes  ^  can  be 
expressed  in  terms  of  as  follows: 


where  W  is  a  Cq  x  Nq  block  matrix  with  block  columns  Wi, The 
joint  distribution  of  y  and  s  is  thus: 

p((^)|Wo,5)=A'((|),q)  (C.2) 

where: 

4>y  -$yW 
W$y  V 

(C,3) 

V  =  W$vW~  +  *s  (C.4) 

and  where  is  a  Nq  x  Nq  block  diagonal  matrix  with  diagonal  blocks 
and  where  is  a  C9  x  Cq  block  diagonal  matrix  with  diagonal 


Q  = 


I  0 

0 

t _ 

1 

_ 1 

-W  I 

0  $5 

0  I 

where: 


I  0 
W  I 


4> 


(C.l) 
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blocks  $7v+i, $;v+r-  Using  Bayes  Rule  and  substituting  this  Gaussian 
density  into  Lq'. 


Lo  =  majclogp(j/,s, /fo|y) 

y 

=  log  p(y,  siifo,  y)  +  log  p(i?o)  (C.5) 

y 

=  max--(y’'  -  y^/)Q"‘  (  ~  ~  J  “  ^log|27rQl  +  ^  log(l  -  P^) 

y2--  \  s  J  2 


To  maximize  this  over  y,  we  will  use  the  following  lemma: 
Lemma  B 
Let: 


Then: 


E{a)  = 


mjnE{a)=fV^-^^0 

QL 

and  the  minimum  is  achieved  at: 

i  =  a  - 

Also: 


■  U.a 

Va/ 

f  ^  \ 

a  a  ' 

0 

.  V>a 

\ 

1  - 1 

^  / 

Proof:  We  can  factor  the  matrix  as  follows: 


(C.6) 


(cj; 


(C.8) 


(C.9) 


■  Vaa 

'1  KiVfg'] 

o 

_ 1 

-  v„eV;^'Vf„  0 
0  Vff0 


Thus  its  inverse  has  the  form: 


1  0 
0 

(C.IO) 
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■ 

-1 

V00  . 

I 


'pa 


Substituting  into  (C.6)  gives: 


VaeVfs'Vs,)-' 

0 


0  I 
(C.ll) 


Minimizing  over  a  and  plugging  back  into  {C.12)  completes  the  proof. 

Q.E.D. 

Applying  this  lemma  to  (C.5),  the  maximum  over  y  is  achieved  at: 


y  =  y  +  $yW^V-‘s 

and  the  likelihood  equals: 

(C.13) 

Lo  =  +  qfo 

(C.14) 

where: 


lo 


1 


N+C 


-  log|27rQ|  +  *og(l  -  Pm) 

^  m=l 


l,Og 

N  +  C 


27r 


0 

0  $5 


N+C 

+  13  •oell  “  Prn) 

m—l 

1  /v-fc-  N+C 

-  log|27r$^|  +  53  logll  -  ^m) 

^  m—l  m=l 


(C.15) 


For  k  =  I,  ...,N  +  C,  hypothesis  Hk  models  the  processor  noise  as  a 
non-zero  mean  Gaussian  random  variable  with  density: 


=  (C.16) 

where  is  unknown.  The  other  processor  errors  are  the  same  as  under 
Hq.  The  joint  distribution  of  y  and  s  given  Hk  and  is  then: 
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p 


f  =  A' 


(C.17) 


where: 


[I  ol 

0 

1 

h-4 

1 

0  $5 

[0  I  J 

(C.18) 


For  k  =  Ijt  is  a  Nq  x  q  matrix  which  is  the  block  column  of 

an  Nq  x  Nq  identity  matrix,  is  just  with  the  /c'*  diagonal  block 
replaced  by  and  $s  equals  $s-  For  k  =  N  1,...,N  +  C,lk  =  0,^k 
equals  and  is  just  with  the  k*^  diagonal  block  replaced  by 
Thus  applying  Bayes’  Rule  and  substituting  the  formula  for  a  Gaussian: 


Lk 


m^  rnax  log  p{y,  s,  Hk\y,  <t>.) 

y  6  ~  ~ 

max  rnax  log  p{y,  s\Hk,y4>.) 

^  t  ■ 


mMmax-i(/  -  s’"  - 

y  2  -  -  -* 

-^logl27rQ(*)l  +  ^og  p{Hk) 


f"  y  -  y  - 

(C.19) 


Applying  lemma  B  to  maximize  over  y  gives: 


y  =  y  -  M,  +  $kW^(V  -  WfcA$*Wl’)-*(s  +  (C.20) 

Substituting  back  into  (0.19)  gives: 

Lk  =  max-i(s  +  W*^,)^(V- + 

t  ^ 

t  N-\-C 

-- log|27rQ(''l|  +  logP*  -  ^  log(l  -  P„i)  (0.21) 

m  =  1 
m  ^  k 
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But  applying  lemma  A,  the  maximum  occurs  at  exactly  the  same  value  of 
as  for  our  previous  algorithm,  (B.5).  The  likelihood  at  the  maximum 
equals: 


_  i  1  — 

Lt  =  -iu  +  W^jfV-'U  +  Wtij--  E  Iog2ir4’™-2log2jr** 

^  m  =  1 

m  ^  k 

N+C  ,  . 

+  logP*+  L  •ogCl-Z’m)  (^-22) 

m  =  1 
m  ^  k 

Substituting  (B.S)  into  the  formula  for  y  and  simplifying  using  lemma  A 
again: 

V  =y-Ifcy.  +  5KW^V-‘U+ Wfc0j  (C.23) 

or: 


f  y^  +  $mWj;;V-i(s  +  ^  for  m  ^  fc  ^^  24) 

1  Vk  -  ik  +  ^fcWrV-*(s  +  for  m  =  A: 


Using  (B.9),  the  case  m  =  k  simplifies: 


ym  = 


(C.25) 


+  WtJJ  form^fc 

y*  -  lit  for  m  =  fc 

Now  we  compute  the  relative  likelihoods  as  before.  The  quadratic  terms 
in  s  are  exactly  the  same  as  in  the  previous  method,  but  the  constants  are 
different: 


L\  =  2{Lk-Lo) 

=  /V->Wit(WrV-'Wit)-‘ W^V-'s  +  (C.26) 


where: 
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I 

Ik  = 


-  log  |27r$t|  +  2  log  Pk  +  log  \2'n<f>^\  -  2  log(l  -  P*) 
-log|?.*i'!+2log(^^)  (C,27) 
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Appendix  D 

Mean  and  Variance  of 
Likelihoods 


To  compute  the  mean  and  variance  of  the  relative  likelihoods  we  need 
the  following  lemma: 

Lemma  C:  Suppose  a,  are  Gaussian  random  variables  with  means  a, 
and  cross-covariance  Vq^j.  Then  for  any  matrix  A: 

E\a^Ap]  =  tT{AE\P^a^]]  =  tr  (D.l) 

Proof: 


E\a^Ap] 


El'^aiAijPj] 


=  tT{A\M  +  Vj^]} 

=  tr{A[5a^  +  V^.]}  (D.2) 

Q.E.D. 

With  this  lemma,  we  can  compute  the  expected  value  of  under  each 
hypothesis  Start  with  the  following  formula  for  L^: 
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I'tol 


L'„  =  /v-'w„(w;;;vw„)-‘w;:;v-‘5  +  (d.s) 

Let  <f)^  be  the  mean  of  the  failure  of  the  processor  or  syndrome.  Using 
lemma  C,  plus  the  fact  that  tr{AJ9}  =  tT{BA}: 


E[L'M  =  tr{v->w„(w;;v-'w„)-‘w;:v-'(v)}+',:, 

=  trj(W;Lv-‘W„)(W>-W„)-)  +-7:„ 

=  tr{I}  +  iL 

=  <!  +  '<'„  (D.4) 

For  k  =  1, N  we  use  the  fact  that: 


E[s\H,]  = 

y^iT[s\Ht]  =  V-WfcA^*Wr  (0.5) 

Then  using  lemma  C  again: 


E\L’M 


=  tr  {v-'w„{w;[v-' w„)-' +  V  -  w.AtiWj']}  + 

=  tr{wj'v-'w„(w;;;v-‘w„)-'w;;;v-'wtgjj^  -  +1} 

=  tr  -  A$»l}  +  (D.6) 

where: 


R*m  =  WrV-*W^  (0.7) 

To  compute  the  variance,  we  need  the  following  lemmas: 

Lemma  D:  Suppose  a,  7,  A  are  zero-mean,  jointly  Gaussian  random  vari¬ 
ables.  Then: 
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E\aM^  =  E\a0]E['iX\  +  E\a'i]E\pX\  +  E\aX\E\P^\ 


ElaP"^]  =  0  (D.8) 

Proof:  A  standard  result  (see  (Helstrom  84]  pp.  210) 

Lemma  E:  Suppose  a,/3,7,A  are  jointly  Gaussian  random  variables  with 
respective  means  a,/?,  A^.  Then: 


E\a/3-iX]  =  E\a(3\E{-iX]  +  E\a'i\E[pX\  +  E[aX\E\p^\  -  2uMX  (D.9) 

Proof:  Let  a  =  a  —  cE;  define  /3,7, A  similarly.  Then  a,/?, 7, A  are  zero 
mean  Gaussian  random  variables: 

E\afi-iX\  =  £1(0  +  a){0  +  P){i+  l){X  +  A)]  (D.IO) 

Multiplying  this  out  gives  2*  =  16  terms.  The  expectation  of  the  product  of 
an  odd  number  of  zero-mean  random  variables  is  zero.  Therefore,  retaining 
only  the  terms  with  an  even  number  of  random  variables  gives: 


E[aff'iX]  =  E[afi^X]  +  a0E[qX]  +  a'^E{0X]  +  aXE[P^] 


+l3')E\oiX]  +  l3^E\cc^]  -t-  7AE[a^|  +  a(3'^X 

=  E{dcfi]E\iX]  +  E\a^}E\^X]  +  E[aX]E[h] 

+a0E{^X]  +  ^E{0X]  +  aXE\di] 

■r0^E[aX]  +  pXE[a^\  -h  ^XE[a0\  +  a0^X  (D-H) 

Factoring,  and  using  the  fact  that: 

E\a0]  ^  E[a0\ -V  a0  (D.12) 
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proves  the  lemma. 


Q.E.D. 

Lemma  F:  Suppose  0,^,7,  and  A  are  jointly  Gaussian  random  variables 
with  respective  means  0,^,7,  and  A.  Let  A,  G  be  arbitrary  matrices.  Then: 

E\a^A^G\\  =  E\^A^\E[fG\\  +  TT{AE\§:^\GE\ja^\] 

+tr{A^£;[a7^]GJ5;|7^^]}  -  25^Ag7^GA  (D.13) 

Proof: 

E[a^A'^G'\\ 

=  Y.  Ai^jGuE[ai0j'ik>^i] 

=  E  AijGkA{E[aii3j]E['ikXi]  +  E\ank]E\(3kXi]  +  E\a^i]E\P^^^^ 

-2  J2  Ai,jGk.iai^3lk^i  (D.14) 

^i3 

The  lemma  follows  directly. 

Q.E.D. 

Now  to  derive  the  covariances  of  the  relative  likelihoods  L'^  under  each 
hypothesis.  Using  lemma  F : 

Cov|i,',I,;|/f,|  =  E\L',L',\H,\  -  E\L'^\Ht]E\L',\Hi,\ 

=  2tr  {v-'WpR;;wJ'v-'£;|m’'|^*|v->  W,R-' Wj'V-‘£:|M"'|i/tl} 

-2(5jR*,R^'R,i?.)(gR»,R-‘R,i?j)  (D.15) 
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But  under  hypothesis  Hq, 


=  Vand^^  =  0  (D.16) 

Substituting  into  (D.15): 

Cov(L;,  L;|/fo)  =  2tr  (R,pR-‘R„R-' }  (D.17) 

Under  hypothesis  Hk  for  k  =  equation  (D.5)  implies  that: 

+  V  -  (D.18) 

Therefore: 

cov(L;,L;ia.) 

=  2tr  {R^‘|Rp»(5^j|'  -  A«.)R*,  +  -  A«>*)R*p  +  R,pl) 

-2gRppR-‘RptJJg^R»,R-'R,t^J  (D.19) 


The  formulas  for  the  white  noise  case  are  found,  by  substituting: 


Appendix  E 

Mean  and  Variance  of 
Processor  Output  Estimates 


First  we  compute  the  expected  value  of  the  estimates  of  the  processor  out¬ 
puts  under  each  hypothesis.  Under  Hq: 


E[V\Ho\  =  E[y|Fo]-l-$KW^V-’E[s:|/fo] 

=  y  +  0 

=  y  (E.i) 

Under  Hk  for  k  =  taking  the  expectation  of  (C.25): 

=  (y  +  iki,)  -  iki,  + 

=  y  (E.2) 

Under  hypothesis  Hq,  we  calculate  the  variance  of  y  as  follows:  Rewriting 
(C.13): 
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From  the  Gaussian  formula  in  (C.2): 


(E.3) 


Var(yl^o)  =  (I^yW^V-^) 


W$y  V 


(  V-^W$y  ) 


=  $y  -  <&KW^V-*W$y  (E.4) 

Under  hypothesis  Hk  for  /:  =  1,  +  C,  each  component  of  y  has  value: 


y^  +  $„,W^V-i(s  + Wfcjj  form  A: 

U.m  ~  for  m  =  A; 

_  i  y^  +  ^mWj^lPifcS  for  m  7^  A: 

“  1  |/fc  + (W[V-*Wfc)-‘W[V-‘5  form  =  fc 

where: 

p*  =  I  -  w,(w[v-*w*)-'w[v-» 

is  a  projection  operator  which  is  orthogonal  to  W*;  thus  PtWjt  =  0.  The 
following  subcomputations  axe  useful  in  evaluating  the  covariance: 

=  ^mOmp  for  k,p^  k  (E.7) 


(E.5) 

(E.6) 


Cov(s,y^|/ffc,0j 


J=1 
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(E.8) 


(orm^k 
-  for  m  =  A: 


Coy{s,s\Hk,l,)  =  V-WfcA$tW^  (E.9) 

Using  these  results,  we  can  derive  the  covariances  of  For  m  ^  k,p  ^  k: 

Cov(t^,tJiffc,^J  =  $„a„,p-2$„W^V-‘PfcWp$p 

+$,„Wj;;V-iPi(V  -  WtA$fcWr)PrV-*Wp$p 
=  -  $„,w;;;v->p*Wp<^p 

=  ^mOmp  -  ^mlRmp  -  Rmk^lk^kp]^p  (E.IO) 

For  m  ^  k: 

Cov(g^,  =  ^mO'mp  -  ^mW^V"*PikWfc$fc 

-^n.W^V-^W^WfV-iWfc)"* 

+^„,WlY-^Pk{y  -  WfcA$tWr)V-iWt(WrV-iWfc)‘' 

=  (E.ll) 

Finally: 

-$fcWj’V-*Wfc(Wj’V->W*)-* 

+  (Wj'V-‘ Wfc)"'W[V-»(V  -  WfcA$tWj’)V-*Wfc(W^V-*W*; 
=  -$*  +  (Wj'V-»W0-'- A<&* 

=  R-ki-^k  .  (E.- 
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Appendix  F 

Proof  of  ww'^  =  I  For  Complete 
Sets  of  Weight  Vectors 


Let  us  construct  the  weight  vectors  in  the  following  manner.  Let  each 
weight  Wm,k  he  a  small  integer  in  the  symmetrical  range  —L  to  +L.  Form 
all  possible  {2L+  1)^  vectors  of  this  type.  Now  eliminate  from  this  set  the 
zero  vector,  and  any  vector  which  is  an  integral  multiple  of  another  vector. 
Now  note  that: 


WW^  =  ]^W*W^  (F.l) 

k  ■ 

where  the  are  the  individual  weight  vectors.  Examine  the  (t,j)  block 
component  of  this  matrix  for  any  i  ^  j.  Suppose  there  is  some  weight 
vector  Wfc  which  has  weight  values  a, I  and  0,1  in  the  t''*  and  positions. 
By  symmetry,  there  must  be  another  weight  vector  which  either  has 
weight  values  a, I  and  —0jl,  or  else  —a, I  and  0jl  in  these  positions.  (This 
is  just  the  definition  of  a  complete  set  of  weight  vectors.)  Then  the  (i,^) 
element  of  WfcWjn-WinW^  is  {ai0j-ai0j)l  =  0.  Since  the  weight  vectors 
can  all  be  grouped  into  pairs  in  this  manner,  the  (f,^)  element  of  WW^ 
must  be  zero  for  i  ^  j. 

The  same  proof  also  holds  if  the  weights  are  chosen  to  be  complex 
integers  chosen  from  a  symmetrical  range. 
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